Skip to main content

Operational Runbooks

Step-by-step procedures for common operational tasks and incident response


📋 Table of Contents​

  1. Daily Operations
  2. Data Sync Runbooks
  3. Incident Response Runbooks
  4. Database Operations
  5. Deployment Runbooks
  6. Integration Runbooks
  7. Recovery Runbooks

Daily Operations​

Morning Health Check​

Purpose: Verify all systems are operational at start of business day

Steps:

  1. Check service health dashboard

    # All services should show "healthy"
    - complyai-frontend: https://app.complyai.io/health
    - complyai-core: https://api.complyai.io/health
    - complyai-api: internal health check
    - complyai-maestro: internal health check
  2. Review overnight alerts

    • Check #alerts Slack channel
    • Review CloudWatch alarms
    • Check error rates in monitoring
  3. Verify scheduled jobs ran successfully

    • Ad sync (every 15 minutes)
    • Score calculation (every 30 minutes)
    • Daily reports (6 AM UTC)
  4. Check external service status

Escalation: If any critical service is down, trigger incident response


Ad Sync Monitoring​

Purpose: Ensure ad data is syncing from Meta correctly

Check Points:

  1. Last sync timestamp per organization

    SELECT o.name, MAX(oa.last_sync) as last_sync
    FROM organizations o
    JOIN org_ad_accounts oaa ON o.id = oaa.organization_id
    JOIN org_ads oa ON oaa.id = oa.org_ad_account_id
    GROUP BY o.name
    ORDER BY last_sync DESC;
  2. Sync failure count (last 24 hours)

    SELECT COUNT(*) as failed_syncs
    FROM activity_events
    WHERE action = 'ad_sync_failed'
    AND created_at > NOW() - INTERVAL '24 hours';
  3. Queue depth for sync tasks

    • Check Celery flower dashboard
    • Queue ad_sync should be < 100 pending

If sync is stale (> 30 min):

  1. Check Celery workers are running
  2. Check Meta API rate limits
  3. Check for token expiration
  4. Manually trigger sync if needed

Data Sync Runbooks​

RB-SYNC-001: Manual Ad Sync Trigger​

When to use: Ad sync is stale or customer reports missing ads

Prerequisites:

  • SSH access to bastion
  • Valid Meta access token for organization

Steps:

  1. Identify the organization

    SELECT id, name, meta_business_id 
    FROM organizations
    WHERE name ILIKE '%customer_name%';
  2. Trigger sync via API (internal)

    curl -X POST https://api-internal.complyai.io/sync/organization/{org_id} \
    -H "X-Internal-Key: ${INTERNAL_API_KEY}"
  3. Monitor sync progress

    # Watch Celery logs
    kubectl logs -f deployment/celery-worker -n production | grep "org_id"
  4. Verify completion

    SELECT COUNT(*) as ads_synced
    FROM org_ads
    WHERE org_ad_account_id IN (
    SELECT id FROM org_ad_accounts WHERE organization_id = {org_id}
    )
    AND updated_at > NOW() - INTERVAL '5 minutes';

Rollback: N/A (sync is additive)


RB-SYNC-002: Token Refresh​

When to use: Meta API returning 401/expired token errors

Prerequisites:

  • Access to AWS Secrets Manager
  • Fernet encryption key

Steps:

  1. Identify affected accounts

    SELECT oba.id, oba.business_id, o.name
    FROM org_business_accounts oba
    JOIN organizations o ON oba.organization_id = o.id
    WHERE oba.token_expires_at < NOW() + INTERVAL '7 days';
  2. For User Access Tokens (60-day):

    • User must re-authenticate via OAuth flow
    • Send notification to organization admin
    INSERT INTO notifications (user_id, type, message, created_at)
    SELECT u.id, 'token_expiring', 'Your Meta connection needs refresh', NOW()
    FROM users u
    JOIN user_organizations uo ON u.id = uo.user_id
    WHERE uo.organization_id = {org_id} AND uo.role = 'owner';
  3. For System User Tokens (long-lived):

    • Generate new token in Meta Business Settings
    • Encrypt and update in database
    from cryptography.fernet import Fernet
    fernet = Fernet(ENCRYPTION_KEY)
    encrypted_token = fernet.encrypt(new_token.encode())
    # Update org_business_accounts.system_user_access_token
  4. Verify token works

    curl "https://graph.facebook.com/v19.0/me?access_token={token}"

RB-SYNC-003: Webhook Resubscription​

When to use: Webhooks not being received, subscription expired

Steps:

  1. Check current subscriptions

    curl "https://graph.facebook.com/v19.0/{app_id}/subscriptions?access_token={app_token}"
  2. Verify webhook endpoint is accessible

    curl -X GET "https://api.complyai.io/webhooks/meta/verify?hub.mode=subscribe&hub.verify_token={verify_token}"
  3. Resubscribe to webhooks

    curl -X POST "https://graph.facebook.com/v19.0/{app_id}/subscriptions" \
    -d "object=ad_account" \
    -d "callback_url=https://api.complyai.io/webhooks/meta" \
    -d "fields=ad_status,ad_creative" \
    -d "verify_token={verify_token}" \
    -d "access_token={app_token}"
  4. Test webhook delivery

    • Make a test change in Meta Business Manager
    • Verify event received in logs

Incident Response Runbooks​

RB-INC-001: Service Outage​

Severity: P1/P2

Symptoms:

  • Health check failures
  • User reports of errors
  • Spike in 5xx responses

Immediate Actions:

  1. Create incident channel: #incident-YYYYMMDD-brief-desc
  2. Assign Incident Commander (IC)
  3. Page on-call if not already engaged

Diagnosis Steps:

  1. Check which services are affected

    # ECS service status
    aws ecs describe-services --cluster production --services \
    complyai-core complyai-api complyai-maestro complyai-frontend
  2. Check recent deployments

    # Last 5 deployments
    aws ecs describe-services --cluster production --query 'services[].deployments[:5]'
  3. Check database connectivity

    psql $DATABASE_URL -c "SELECT 1;"
  4. Check external dependencies

    • Meta API status
    • Stripe API status
    • Auth0 status

Recovery Actions:

  • If deployment issue: Roll back to previous task definition

    aws ecs update-service --cluster production --service {service} \
    --task-definition {previous-task-def}
  • If database issue: Failover to replica if available

    aws rds failover-db-cluster --db-cluster-identifier complyai-prod
  • If external dependency: Enable circuit breaker, serve cached/degraded

Post-Incident:

  1. Document timeline in incident channel
  2. Schedule post-mortem within 48 hours
  3. Create follow-up tickets for improvements

RB-INC-002: Data Breach Response​

Severity: P1

Symptoms:

  • Unauthorized access detected
  • Customer reports data exposure
  • Security tool alerts

Immediate Actions (First 15 minutes):

  1. Contain - Revoke compromised credentials immediately

    -- Disable affected user
    UPDATE users SET is_active = false WHERE id = {user_id};
    -- Invalidate sessions
    DELETE FROM user_sessions WHERE user_id = {user_id};
  2. Preserve - Capture evidence before changes

    # Snapshot affected database
    aws rds create-db-snapshot --db-instance-identifier complyai-prod \
    --db-snapshot-identifier incident-$(date +%Y%m%d-%H%M)

    # Export relevant logs
    aws logs filter-log-events --log-group-name /ecs/complyai-core \
    --start-time {epoch_ms} --end-time {epoch_ms} > incident_logs.json
  3. Notify - Alert leadership immediately

    • Page CEO, CTO, Legal
    • Do NOT communicate externally yet

Investigation (First 4 hours):

  1. Determine scope

    • Which records accessed?
    • Which users affected?
    • What data was exposed?
  2. Determine vector

    • How did attacker gain access?
    • Credential compromise?
    • Application vulnerability?
  3. Document everything

    • Timeline of events
    • Actions taken
    • Evidence collected

Notification Requirements:

TriggerActionTimeline
EU user dataNotify supervisory authority72 hours
CA resident dataPrepare notification"Expedient"
Any PIINotify affected usersAfter investigation

RB-INC-003: Database Performance Degradation​

Severity: P2/P3

Symptoms:

  • Slow API response times
  • High database CPU
  • Lock contention

Diagnosis:

  1. Check current queries

    SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state
    FROM pg_stat_activity
    WHERE (now() - pg_stat_activity.query_start) > interval '5 seconds'
    ORDER BY duration DESC;
  2. Check for locks

    SELECT blocked_locks.pid AS blocked_pid,
    blocking_locks.pid AS blocking_pid,
    blocked_activity.query AS blocked_query
    FROM pg_catalog.pg_locks blocked_locks
    JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype
    JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
    WHERE NOT blocked_locks.granted;
  3. Check table bloat

    SELECT relname, n_dead_tup, n_live_tup, 
    round(n_dead_tup::numeric / nullif(n_live_tup, 0) * 100, 2) as dead_ratio
    FROM pg_stat_user_tables
    ORDER BY n_dead_tup DESC LIMIT 10;

Remediation:

  • Kill long-running queries (if safe)

    SELECT pg_terminate_backend({pid});
  • Run VACUUM if bloat detected

    VACUUM ANALYZE {table_name};
  • Add missing index (if scan detected)

    CREATE INDEX CONCURRENTLY idx_name ON table_name (column);
  • Scale up RDS if resource constrained

    aws rds modify-db-instance --db-instance-identifier complyai-prod \
    --db-instance-class db.r5.xlarge --apply-immediately

Database Operations​

RB-DB-001: Schema Migration​

When to use: Deploying database schema changes

Pre-Migration Checklist:

  • Migration tested in staging
  • Rollback script prepared
  • Backup taken
  • Stakeholders notified
  • Low-traffic window identified

Steps:

  1. Take backup before migration

    aws rds create-db-snapshot --db-instance-identifier complyai-prod \
    --db-snapshot-identifier pre-migration-$(date +%Y%m%d-%H%M)
  2. Run migration

    # Using Alembic (Flask)
    alembic upgrade head

    # Or using Django
    python manage.py migrate
  3. Verify migration

    -- Check new columns/tables exist
    \d+ table_name
  4. Monitor for issues (15-30 minutes)

    • Application errors
    • Performance degradation
    • Data integrity issues

Rollback:

# Alembic rollback
alembic downgrade -1

# Or restore from snapshot (nuclear option)
aws rds restore-db-instance-from-db-snapshot \
--db-instance-identifier complyai-prod-restored \
--db-snapshot-identifier pre-migration-{timestamp}

RB-DB-002: Data Backfill​

When to use: Populating new columns, fixing data issues

Steps:

  1. Create backfill script

    # backfill_example.py
    import psycopg2
    from tqdm import tqdm

    BATCH_SIZE = 1000

    conn = psycopg2.connect(DATABASE_URL)
    cur = conn.cursor()

    cur.execute("SELECT COUNT(*) FROM target_table WHERE new_column IS NULL")
    total = cur.fetchone()[0]

    for offset in tqdm(range(0, total, BATCH_SIZE)):
    cur.execute("""
    UPDATE target_table
    SET new_column = computed_value
    WHERE id IN (
    SELECT id FROM target_table
    WHERE new_column IS NULL
    LIMIT %s
    )
    """, (BATCH_SIZE,))
    conn.commit()
  2. Test on staging first

  3. Run during low-traffic period

  4. Monitor progress and database load

Best Practices:

  • Use batches (1000-5000 rows)
  • Include progress logging
  • Add sleep between batches if needed
  • Run ANALYZE after large backfills

Deployment Runbooks​

RB-DEPLOY-001: Standard Service Deployment​

Steps:

  1. Merge PR to main branch

  2. CI/CD automatically:

    • Runs tests
    • Builds Docker image
    • Pushes to ECR
    • Updates ECS task definition
    • Triggers rolling deployment
  3. Monitor deployment

    aws ecs describe-services --cluster production --services {service} \
    --query 'services[].deployments'
  4. Verify new version

    curl https://api.complyai.io/health | jq '.version'
  5. Monitor for 15 minutes

    • Error rates
    • Response times
    • User reports

Rollback:

# Get previous task definition
aws ecs describe-services --cluster production --services {service} \
--query 'services[].taskDefinition'

# Roll back to previous
aws ecs update-service --cluster production --service {service} \
--task-definition {previous-revision}

RB-DEPLOY-002: Database Migration Deployment​

Steps:

  1. Deploy with migration flag disabled
  2. Verify application works with old schema
  3. Run migration (see RB-DB-001)
  4. Deploy with migration flag enabled (if feature-flagged)
  5. Monitor

Integration Runbooks​

RB-INT-001: Meta API Rate Limit Recovery​

When to use: Hitting Meta API rate limits (error code 4 or 17)

Symptoms:

  • Error: "Application request limit reached"
  • Error: "User request limit reached"

Diagnosis:

  1. Check current rate limit status

    curl "https://graph.facebook.com/v19.0/me?access_token={token}&debug=all"
    # Look at x-business-use-case-usage header
  2. Check ComplyAI request logs

    SELECT COUNT(*), date_trunc('hour', created_at) as hour
    FROM api_request_logs
    WHERE service = 'meta'
    AND created_at > NOW() - INTERVAL '24 hours'
    GROUP BY hour
    ORDER BY hour;

Remediation:

  1. Enable rate limit backoff (if not already)
  2. Reduce sync frequency temporarily
  3. Prioritize high-value accounts
  4. Wait for rate limit reset (typically 1 hour)

Prevention:

  • Implement exponential backoff
  • Use batch endpoints where possible
  • Cache responses where appropriate

RB-INT-002: Stripe Webhook Replay​

When to use: Missed or failed Stripe webhooks

Steps:

  1. Identify missed events in Stripe Dashboard

    • Webhooks → Select endpoint → Events
  2. Get event details

    curl https://api.stripe.com/v1/events/{event_id} \
    -u sk_live_xxx:
  3. Replay via Stripe Dashboard

    • Click "Resend" on the event
  4. Or process manually

    # Fetch event and process
    event = stripe.Event.retrieve(event_id)
    handle_stripe_webhook(event)

Recovery Runbooks​

RB-REC-001: Full Database Restore​

When to use: Catastrophic data loss, corruption beyond repair

Prerequisites:

  • Identify target snapshot/point-in-time
  • Notify all stakeholders
  • Prepare for potential data loss

Steps:

  1. Identify recovery point

    # List available snapshots
    aws rds describe-db-snapshots --db-instance-identifier complyai-prod

    # Or check PITR window
    aws rds describe-db-instances --db-instance-identifier complyai-prod \
    --query 'DBInstances[].LatestRestorableTime'
  2. Restore to new instance

    # From snapshot
    aws rds restore-db-instance-from-db-snapshot \
    --db-instance-identifier complyai-prod-restored \
    --db-snapshot-identifier {snapshot-id}

    # Or point-in-time
    aws rds restore-db-instance-to-point-in-time \
    --source-db-instance-identifier complyai-prod \
    --target-db-instance-identifier complyai-prod-restored \
    --restore-time {ISO-timestamp}
  3. Verify restored data

    -- Run data integrity checks
    SELECT COUNT(*) FROM users;
    SELECT COUNT(*) FROM organizations;
    SELECT MAX(created_at) FROM activity_events;
  4. Update application config to point to restored DB

  5. Rename instances

    # Rename old
    aws rds modify-db-instance --db-instance-identifier complyai-prod \
    --new-db-instance-identifier complyai-prod-old

    # Rename restored
    aws rds modify-db-instance --db-instance-identifier complyai-prod-restored \
    --new-db-instance-identifier complyai-prod
  6. Verify application connectivity

  7. Document data loss window


RB-REC-002: Service Recovery from Scratch​

When to use: Need to rebuild a service completely

Steps:

  1. Ensure Docker image is available in ECR

  2. Create new task definition (or use existing)

  3. Create/update ECS service

    aws ecs create-service \
    --cluster production \
    --service-name {service} \
    --task-definition {task-def} \
    --desired-count 2 \
    --launch-type FARGATE \
    --network-configuration "awsvpcConfiguration={subnets=[...],securityGroups=[...]}"
  4. Update load balancer target group

  5. Verify health checks passing

  6. Update DNS if needed


Runbook Index​

IDNameSeverityCategory
RB-SYNC-001Manual Ad Sync TriggerP3Data Sync
RB-SYNC-002Token RefreshP2Data Sync
RB-SYNC-003Webhook ResubscriptionP2Data Sync
RB-INC-001Service OutageP1/P2Incident
RB-INC-002Data Breach ResponseP1Incident
RB-INC-003Database PerformanceP2/P3Incident
RB-DB-001Schema MigrationP3Database
RB-DB-002Data BackfillP3Database
RB-DEPLOY-001Standard DeploymentP3Deployment
RB-DEPLOY-002Migration DeploymentP2Deployment
RB-INT-001Meta Rate LimitP2Integration
RB-INT-002Stripe Webhook ReplayP3Integration
RB-REC-001Full Database RestoreP1Recovery
RB-REC-002Service RecoveryP1Recovery


Last Updated: December 2024