Operational Runbooks
Step-by-step procedures for common operational tasks and incident response
📋 Table of Contents​
- Daily Operations
- Data Sync Runbooks
- Incident Response Runbooks
- Database Operations
- Deployment Runbooks
- Integration Runbooks
- Recovery Runbooks
Daily Operations​
Morning Health Check​
Purpose: Verify all systems are operational at start of business day
Steps:
-
Check service health dashboard
# All services should show "healthy"
- complyai-frontend: https://app.complyai.io/health
- complyai-core: https://api.complyai.io/health
- complyai-api: internal health check
- complyai-maestro: internal health check -
Review overnight alerts
- Check #alerts Slack channel
- Review CloudWatch alarms
- Check error rates in monitoring
-
Verify scheduled jobs ran successfully
- Ad sync (every 15 minutes)
- Score calculation (every 30 minutes)
- Daily reports (6 AM UTC)
-
Check external service status
- Meta Platform Status: https://developers.facebook.com/status/
- Stripe Status: https://status.stripe.com/
- AWS Status: https://status.aws.amazon.com/
Escalation: If any critical service is down, trigger incident response
Ad Sync Monitoring​
Purpose: Ensure ad data is syncing from Meta correctly
Check Points:
-
Last sync timestamp per organization
SELECT o.name, MAX(oa.last_sync) as last_sync
FROM organizations o
JOIN org_ad_accounts oaa ON o.id = oaa.organization_id
JOIN org_ads oa ON oaa.id = oa.org_ad_account_id
GROUP BY o.name
ORDER BY last_sync DESC; -
Sync failure count (last 24 hours)
SELECT COUNT(*) as failed_syncs
FROM activity_events
WHERE action = 'ad_sync_failed'
AND created_at > NOW() - INTERVAL '24 hours'; -
Queue depth for sync tasks
- Check Celery flower dashboard
- Queue
ad_syncshould be < 100 pending
If sync is stale (> 30 min):
- Check Celery workers are running
- Check Meta API rate limits
- Check for token expiration
- Manually trigger sync if needed
Data Sync Runbooks​
RB-SYNC-001: Manual Ad Sync Trigger​
When to use: Ad sync is stale or customer reports missing ads
Prerequisites:
- SSH access to bastion
- Valid Meta access token for organization
Steps:
-
Identify the organization
SELECT id, name, meta_business_id
FROM organizations
WHERE name ILIKE '%customer_name%'; -
Trigger sync via API (internal)
curl -X POST https://api-internal.complyai.io/sync/organization/{org_id} \
-H "X-Internal-Key: ${INTERNAL_API_KEY}" -
Monitor sync progress
# Watch Celery logs
kubectl logs -f deployment/celery-worker -n production | grep "org_id" -
Verify completion
SELECT COUNT(*) as ads_synced
FROM org_ads
WHERE org_ad_account_id IN (
SELECT id FROM org_ad_accounts WHERE organization_id = {org_id}
)
AND updated_at > NOW() - INTERVAL '5 minutes';
Rollback: N/A (sync is additive)
RB-SYNC-002: Token Refresh​
When to use: Meta API returning 401/expired token errors
Prerequisites:
- Access to AWS Secrets Manager
- Fernet encryption key
Steps:
-
Identify affected accounts
SELECT oba.id, oba.business_id, o.name
FROM org_business_accounts oba
JOIN organizations o ON oba.organization_id = o.id
WHERE oba.token_expires_at < NOW() + INTERVAL '7 days'; -
For User Access Tokens (60-day):
- User must re-authenticate via OAuth flow
- Send notification to organization admin
INSERT INTO notifications (user_id, type, message, created_at)
SELECT u.id, 'token_expiring', 'Your Meta connection needs refresh', NOW()
FROM users u
JOIN user_organizations uo ON u.id = uo.user_id
WHERE uo.organization_id = {org_id} AND uo.role = 'owner'; -
For System User Tokens (long-lived):
- Generate new token in Meta Business Settings
- Encrypt and update in database
from cryptography.fernet import Fernet
fernet = Fernet(ENCRYPTION_KEY)
encrypted_token = fernet.encrypt(new_token.encode())
# Update org_business_accounts.system_user_access_token -
Verify token works
curl "https://graph.facebook.com/v19.0/me?access_token={token}"
RB-SYNC-003: Webhook Resubscription​
When to use: Webhooks not being received, subscription expired
Steps:
-
Check current subscriptions
curl "https://graph.facebook.com/v19.0/{app_id}/subscriptions?access_token={app_token}" -
Verify webhook endpoint is accessible
curl -X GET "https://api.complyai.io/webhooks/meta/verify?hub.mode=subscribe&hub.verify_token={verify_token}" -
Resubscribe to webhooks
curl -X POST "https://graph.facebook.com/v19.0/{app_id}/subscriptions" \
-d "object=ad_account" \
-d "callback_url=https://api.complyai.io/webhooks/meta" \
-d "fields=ad_status,ad_creative" \
-d "verify_token={verify_token}" \
-d "access_token={app_token}" -
Test webhook delivery
- Make a test change in Meta Business Manager
- Verify event received in logs
Incident Response Runbooks​
RB-INC-001: Service Outage​
Severity: P1/P2
Symptoms:
- Health check failures
- User reports of errors
- Spike in 5xx responses
Immediate Actions:
- Create incident channel: #incident-YYYYMMDD-brief-desc
- Assign Incident Commander (IC)
- Page on-call if not already engaged
Diagnosis Steps:
-
Check which services are affected
# ECS service status
aws ecs describe-services --cluster production --services \
complyai-core complyai-api complyai-maestro complyai-frontend -
Check recent deployments
# Last 5 deployments
aws ecs describe-services --cluster production --query 'services[].deployments[:5]' -
Check database connectivity
psql $DATABASE_URL -c "SELECT 1;" -
Check external dependencies
- Meta API status
- Stripe API status
- Auth0 status
Recovery Actions:
-
If deployment issue: Roll back to previous task definition
aws ecs update-service --cluster production --service {service} \
--task-definition {previous-task-def} -
If database issue: Failover to replica if available
aws rds failover-db-cluster --db-cluster-identifier complyai-prod -
If external dependency: Enable circuit breaker, serve cached/degraded
Post-Incident:
- Document timeline in incident channel
- Schedule post-mortem within 48 hours
- Create follow-up tickets for improvements
RB-INC-002: Data Breach Response​
Severity: P1
Symptoms:
- Unauthorized access detected
- Customer reports data exposure
- Security tool alerts
Immediate Actions (First 15 minutes):
-
Contain - Revoke compromised credentials immediately
-- Disable affected user
UPDATE users SET is_active = false WHERE id = {user_id};
-- Invalidate sessions
DELETE FROM user_sessions WHERE user_id = {user_id}; -
Preserve - Capture evidence before changes
# Snapshot affected database
aws rds create-db-snapshot --db-instance-identifier complyai-prod \
--db-snapshot-identifier incident-$(date +%Y%m%d-%H%M)
# Export relevant logs
aws logs filter-log-events --log-group-name /ecs/complyai-core \
--start-time {epoch_ms} --end-time {epoch_ms} > incident_logs.json -
Notify - Alert leadership immediately
- Page CEO, CTO, Legal
- Do NOT communicate externally yet
Investigation (First 4 hours):
-
Determine scope
- Which records accessed?
- Which users affected?
- What data was exposed?
-
Determine vector
- How did attacker gain access?
- Credential compromise?
- Application vulnerability?
-
Document everything
- Timeline of events
- Actions taken
- Evidence collected
Notification Requirements:
| Trigger | Action | Timeline |
|---|---|---|
| EU user data | Notify supervisory authority | 72 hours |
| CA resident data | Prepare notification | "Expedient" |
| Any PII | Notify affected users | After investigation |
RB-INC-003: Database Performance Degradation​
Severity: P2/P3
Symptoms:
- Slow API response times
- High database CPU
- Lock contention
Diagnosis:
-
Check current queries
SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '5 seconds'
ORDER BY duration DESC; -
Check for locks
SELECT blocked_locks.pid AS blocked_pid,
blocking_locks.pid AS blocking_pid,
blocked_activity.query AS blocked_query
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
WHERE NOT blocked_locks.granted; -
Check table bloat
SELECT relname, n_dead_tup, n_live_tup,
round(n_dead_tup::numeric / nullif(n_live_tup, 0) * 100, 2) as dead_ratio
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC LIMIT 10;
Remediation:
-
Kill long-running queries (if safe)
SELECT pg_terminate_backend({pid}); -
Run VACUUM if bloat detected
VACUUM ANALYZE {table_name}; -
Add missing index (if scan detected)
CREATE INDEX CONCURRENTLY idx_name ON table_name (column); -
Scale up RDS if resource constrained
aws rds modify-db-instance --db-instance-identifier complyai-prod \
--db-instance-class db.r5.xlarge --apply-immediately
Database Operations​
RB-DB-001: Schema Migration​
When to use: Deploying database schema changes
Pre-Migration Checklist:
- Migration tested in staging
- Rollback script prepared
- Backup taken
- Stakeholders notified
- Low-traffic window identified
Steps:
-
Take backup before migration
aws rds create-db-snapshot --db-instance-identifier complyai-prod \
--db-snapshot-identifier pre-migration-$(date +%Y%m%d-%H%M) -
Run migration
# Using Alembic (Flask)
alembic upgrade head
# Or using Django
python manage.py migrate -
Verify migration
-- Check new columns/tables exist
\d+ table_name -
Monitor for issues (15-30 minutes)
- Application errors
- Performance degradation
- Data integrity issues
Rollback:
# Alembic rollback
alembic downgrade -1
# Or restore from snapshot (nuclear option)
aws rds restore-db-instance-from-db-snapshot \
--db-instance-identifier complyai-prod-restored \
--db-snapshot-identifier pre-migration-{timestamp}
RB-DB-002: Data Backfill​
When to use: Populating new columns, fixing data issues
Steps:
-
Create backfill script
# backfill_example.py
import psycopg2
from tqdm import tqdm
BATCH_SIZE = 1000
conn = psycopg2.connect(DATABASE_URL)
cur = conn.cursor()
cur.execute("SELECT COUNT(*) FROM target_table WHERE new_column IS NULL")
total = cur.fetchone()[0]
for offset in tqdm(range(0, total, BATCH_SIZE)):
cur.execute("""
UPDATE target_table
SET new_column = computed_value
WHERE id IN (
SELECT id FROM target_table
WHERE new_column IS NULL
LIMIT %s
)
""", (BATCH_SIZE,))
conn.commit() -
Test on staging first
-
Run during low-traffic period
-
Monitor progress and database load
Best Practices:
- Use batches (1000-5000 rows)
- Include progress logging
- Add sleep between batches if needed
- Run ANALYZE after large backfills
Deployment Runbooks​
RB-DEPLOY-001: Standard Service Deployment​
Steps:
-
Merge PR to main branch
-
CI/CD automatically:
- Runs tests
- Builds Docker image
- Pushes to ECR
- Updates ECS task definition
- Triggers rolling deployment
-
Monitor deployment
aws ecs describe-services --cluster production --services {service} \
--query 'services[].deployments' -
Verify new version
curl https://api.complyai.io/health | jq '.version' -
Monitor for 15 minutes
- Error rates
- Response times
- User reports
Rollback:
# Get previous task definition
aws ecs describe-services --cluster production --services {service} \
--query 'services[].taskDefinition'
# Roll back to previous
aws ecs update-service --cluster production --service {service} \
--task-definition {previous-revision}
RB-DEPLOY-002: Database Migration Deployment​
Steps:
- Deploy with migration flag disabled
- Verify application works with old schema
- Run migration (see RB-DB-001)
- Deploy with migration flag enabled (if feature-flagged)
- Monitor
Integration Runbooks​
RB-INT-001: Meta API Rate Limit Recovery​
When to use: Hitting Meta API rate limits (error code 4 or 17)
Symptoms:
- Error: "Application request limit reached"
- Error: "User request limit reached"
Diagnosis:
-
Check current rate limit status
curl "https://graph.facebook.com/v19.0/me?access_token={token}&debug=all"
# Look at x-business-use-case-usage header -
Check ComplyAI request logs
SELECT COUNT(*), date_trunc('hour', created_at) as hour
FROM api_request_logs
WHERE service = 'meta'
AND created_at > NOW() - INTERVAL '24 hours'
GROUP BY hour
ORDER BY hour;
Remediation:
- Enable rate limit backoff (if not already)
- Reduce sync frequency temporarily
- Prioritize high-value accounts
- Wait for rate limit reset (typically 1 hour)
Prevention:
- Implement exponential backoff
- Use batch endpoints where possible
- Cache responses where appropriate
RB-INT-002: Stripe Webhook Replay​
When to use: Missed or failed Stripe webhooks
Steps:
-
Identify missed events in Stripe Dashboard
- Webhooks → Select endpoint → Events
-
Get event details
curl https://api.stripe.com/v1/events/{event_id} \
-u sk_live_xxx: -
Replay via Stripe Dashboard
- Click "Resend" on the event
-
Or process manually
# Fetch event and process
event = stripe.Event.retrieve(event_id)
handle_stripe_webhook(event)
Recovery Runbooks​
RB-REC-001: Full Database Restore​
When to use: Catastrophic data loss, corruption beyond repair
Prerequisites:
- Identify target snapshot/point-in-time
- Notify all stakeholders
- Prepare for potential data loss
Steps:
-
Identify recovery point
# List available snapshots
aws rds describe-db-snapshots --db-instance-identifier complyai-prod
# Or check PITR window
aws rds describe-db-instances --db-instance-identifier complyai-prod \
--query 'DBInstances[].LatestRestorableTime' -
Restore to new instance
# From snapshot
aws rds restore-db-instance-from-db-snapshot \
--db-instance-identifier complyai-prod-restored \
--db-snapshot-identifier {snapshot-id}
# Or point-in-time
aws rds restore-db-instance-to-point-in-time \
--source-db-instance-identifier complyai-prod \
--target-db-instance-identifier complyai-prod-restored \
--restore-time {ISO-timestamp} -
Verify restored data
-- Run data integrity checks
SELECT COUNT(*) FROM users;
SELECT COUNT(*) FROM organizations;
SELECT MAX(created_at) FROM activity_events; -
Update application config to point to restored DB
-
Rename instances
# Rename old
aws rds modify-db-instance --db-instance-identifier complyai-prod \
--new-db-instance-identifier complyai-prod-old
# Rename restored
aws rds modify-db-instance --db-instance-identifier complyai-prod-restored \
--new-db-instance-identifier complyai-prod -
Verify application connectivity
-
Document data loss window
RB-REC-002: Service Recovery from Scratch​
When to use: Need to rebuild a service completely
Steps:
-
Ensure Docker image is available in ECR
-
Create new task definition (or use existing)
-
Create/update ECS service
aws ecs create-service \
--cluster production \
--service-name {service} \
--task-definition {task-def} \
--desired-count 2 \
--launch-type FARGATE \
--network-configuration "awsvpcConfiguration={subnets=[...],securityGroups=[...]}" -
Update load balancer target group
-
Verify health checks passing
-
Update DNS if needed
Runbook Index​
| ID | Name | Severity | Category |
|---|---|---|---|
| RB-SYNC-001 | Manual Ad Sync Trigger | P3 | Data Sync |
| RB-SYNC-002 | Token Refresh | P2 | Data Sync |
| RB-SYNC-003 | Webhook Resubscription | P2 | Data Sync |
| RB-INC-001 | Service Outage | P1/P2 | Incident |
| RB-INC-002 | Data Breach Response | P1 | Incident |
| RB-INC-003 | Database Performance | P2/P3 | Incident |
| RB-DB-001 | Schema Migration | P3 | Database |
| RB-DB-002 | Data Backfill | P3 | Database |
| RB-DEPLOY-001 | Standard Deployment | P3 | Deployment |
| RB-DEPLOY-002 | Migration Deployment | P2 | Deployment |
| RB-INT-001 | Meta Rate Limit | P2 | Integration |
| RB-INT-002 | Stripe Webhook Replay | P3 | Integration |
| RB-REC-001 | Full Database Restore | P1 | Recovery |
| RB-REC-002 | Service Recovery | P1 | Recovery |
Related Documents​
- Service Architecture - System design
- Third-Party Integrations - External APIs
- Data Governance - Incident response policies
- Troubleshooting - Common issue resolution
Last Updated: December 2024