Operational Runbooks

Step-by-step procedures for common operational tasks and incident response

📋 Table of Contents

Daily Operations
Data Sync Runbooks
Incident Response Runbooks
Database Operations
Deployment Runbooks
Integration Runbooks
Recovery Runbooks

Daily Operations

Morning Health Check

Purpose: Verify all systems are operational at start of business day

Steps:

Check service health dashboard

# All services should show "healthy"
- complyai-frontend: https://app.complyai.io/health
- complyai-core: https://api.complyai.io/health
- complyai-api: internal health check
- complyai-maestro: internal health check

Review overnight alerts
- Check #alerts Slack channel
- Review CloudWatch alarms
- Check error rates in monitoring
Verify scheduled jobs ran successfully
- Ad sync (every 15 minutes)
- Score calculation (every 30 minutes)
- Daily reports (6 AM UTC)
Check external service status
- Meta Platform Status: https://developers.facebook.com/status/
- Stripe Status: https://status.stripe.com/
- AWS Status: https://status.aws.amazon.com/

Escalation: If any critical service is down, trigger incident response

Ad Sync Monitoring

Purpose: Ensure ad data is syncing from Meta correctly

Check Points:

Last sync timestamp per organization

SELECT o.name, MAX(oa.last_sync) as last_sync
FROM organizations o
JOIN org_ad_accounts oaa ON o.id = oaa.organization_id
JOIN org_ads oa ON oaa.id = oa.org_ad_account_id
GROUP BY o.name
ORDER BY last_sync DESC;

Sync failure count (last 24 hours)

SELECT COUNT(*) as failed_syncs
FROM activity_events
WHERE action = 'ad_sync_failed'
AND created_at > NOW() - INTERVAL '24 hours';

Queue depth for sync tasks
- Check Celery flower dashboard
- Queue ad_sync should be < 100 pending

If sync is stale (> 30 min):

Check Celery workers are running
Check Meta API rate limits
Check for token expiration
Manually trigger sync if needed

Data Sync Runbooks

RB-SYNC-001: Manual Ad Sync Trigger

When to use: Ad sync is stale or customer reports missing ads

Prerequisites:

SSH access to bastion
Valid Meta access token for organization

Steps:

Identify the organization

SELECT id, name, meta_business_id 
FROM organizations 
WHERE name ILIKE '%customer_name%';

Trigger sync via API (internal)

curl -X POST https://api-internal.complyai.io/sync/organization/{org_id} \
  -H "X-Internal-Key: ${INTERNAL_API_KEY}"

Monitor sync progress

# Watch Celery logs
kubectl logs -f deployment/celery-worker -n production | grep "org_id"

Verify completion

SELECT COUNT(*) as ads_synced
FROM org_ads
WHERE org_ad_account_id IN (
  SELECT id FROM org_ad_accounts WHERE organization_id = {org_id}
)
AND updated_at > NOW() - INTERVAL '5 minutes';

Rollback: N/A (sync is additive)

RB-SYNC-002: Token Refresh

When to use: Meta API returning 401/expired token errors

Prerequisites:

Access to AWS Secrets Manager
Fernet encryption key

Steps:

Identify affected accounts

SELECT oba.id, oba.business_id, o.name
FROM org_business_accounts oba
JOIN organizations o ON oba.organization_id = o.id
WHERE oba.token_expires_at < NOW() + INTERVAL '7 days';

For User Access Tokens (60-day):

User must re-authenticate via OAuth flow
Send notification to organization admin

INSERT INTO notifications (user_id, type, message, created_at)
SELECT u.id, 'token_expiring', 'Your Meta connection needs refresh', NOW()
FROM users u
JOIN user_organizations uo ON u.id = uo.user_id
WHERE uo.organization_id = {org_id} AND uo.role = 'owner';

For System User Tokens (long-lived):

Generate new token in Meta Business Settings
Encrypt and update in database

from cryptography.fernet import Fernet
fernet = Fernet(ENCRYPTION_KEY)
encrypted_token = fernet.encrypt(new_token.encode())
# Update org_business_accounts.system_user_access_token

Verify token works

curl "https://graph.facebook.com/v19.0/me?access_token={token}"

RB-SYNC-003: Webhook Resubscription

When to use: Webhooks not being received, subscription expired

Steps:

Check current subscriptions

curl "https://graph.facebook.com/v19.0/{app_id}/subscriptions?access_token={app_token}"

Verify webhook endpoint is accessible

curl -X GET "https://api.complyai.io/webhooks/meta/verify?hub.mode=subscribe&hub.verify_token={verify_token}"

Resubscribe to webhooks

curl -X POST "https://graph.facebook.com/v19.0/{app_id}/subscriptions" \
  -d "object=ad_account" \
  -d "callback_url=https://api.complyai.io/webhooks/meta" \
  -d "fields=ad_status,ad_creative" \
  -d "verify_token={verify_token}" \
  -d "access_token={app_token}"

Test webhook delivery
- Make a test change in Meta Business Manager
- Verify event received in logs

Incident Response Runbooks

RB-INC-001: Service Outage

Severity: P1/P2

Symptoms:

Health check failures
User reports of errors
Spike in 5xx responses

Immediate Actions:

Create incident channel: #incident-YYYYMMDD-brief-desc
Assign Incident Commander (IC)
Page on-call if not already engaged

Diagnosis Steps:

Check which services are affected

# ECS service status
aws ecs describe-services --cluster production --services \
  complyai-core complyai-api complyai-maestro complyai-frontend

Check recent deployments

# Last 5 deployments
aws ecs describe-services --cluster production --query 'services[].deployments[:5]'

Check database connectivity
```
psql $DATABASE_URL -c "SELECT 1;"
```
Check external dependencies
- Meta API status
- Stripe API status
- Auth0 status

Recovery Actions:

If deployment issue: Roll back to previous task definition

aws ecs update-service --cluster production --service {service} \
  --task-definition {previous-task-def}

If database issue: Failover to replica if available

aws rds failover-db-cluster --db-cluster-identifier complyai-prod

If external dependency: Enable circuit breaker, serve cached/degraded

Post-Incident:

Document timeline in incident channel
Schedule post-mortem within 48 hours
Create follow-up tickets for improvements

RB-INC-002: Data Breach Response

Severity: P1

Symptoms:

Unauthorized access detected
Customer reports data exposure
Security tool alerts

Immediate Actions (First 15 minutes):

Contain - Revoke compromised credentials immediately

-- Disable affected user
UPDATE users SET is_active = false WHERE id = {user_id};
-- Invalidate sessions
DELETE FROM user_sessions WHERE user_id = {user_id};

Preserve - Capture evidence before changes

# Snapshot affected database
aws rds create-db-snapshot --db-instance-identifier complyai-prod \
  --db-snapshot-identifier incident-$(date +%Y%m%d-%H%M)

# Export relevant logs
aws logs filter-log-events --log-group-name /ecs/complyai-core \
  --start-time {epoch_ms} --end-time {epoch_ms} > incident_logs.json

Notify - Alert leadership immediately
- Page CEO, CTO, Legal
- Do NOT communicate externally yet

Investigation (First 4 hours):

Determine scope
- Which records accessed?
- Which users affected?
- What data was exposed?
Determine vector
- How did attacker gain access?
- Credential compromise?
- Application vulnerability?
Document everything
- Timeline of events
- Actions taken
- Evidence collected

Notification Requirements:

Trigger	Action	Timeline
EU user data	Notify supervisory authority	72 hours
CA resident data	Prepare notification	"Expedient"
Any PII	Notify affected users	After investigation

RB-INC-003: Database Performance Degradation

Severity: P2/P3

Symptoms:

Slow API response times
High database CPU
Lock contention

Diagnosis:

Check current queries

SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '5 seconds'
ORDER BY duration DESC;

Check for locks

SELECT blocked_locks.pid AS blocked_pid,
       blocking_locks.pid AS blocking_pid,
       blocked_activity.query AS blocked_query
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
WHERE NOT blocked_locks.granted;

Check table bloat

SELECT relname, n_dead_tup, n_live_tup, 
       round(n_dead_tup::numeric / nullif(n_live_tup, 0) * 100, 2) as dead_ratio
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC LIMIT 10;

Remediation:

Kill long-running queries (if safe)
```
SELECT pg_terminate_backend({pid});
```
Run VACUUM if bloat detected
```
VACUUM ANALYZE {table_name};
```

Add missing index (if scan detected)

CREATE INDEX CONCURRENTLY idx_name ON table_name (column);

Scale up RDS if resource constrained

aws rds modify-db-instance --db-instance-identifier complyai-prod \
  --db-instance-class db.r5.xlarge --apply-immediately

Database Operations

RB-DB-001: Schema Migration

When to use: Deploying database schema changes

Pre-Migration Checklist:

Steps:

Take backup before migration

aws rds create-db-snapshot --db-instance-identifier complyai-prod \
  --db-snapshot-identifier pre-migration-$(date +%Y%m%d-%H%M)

Run migration

# Using Alembic (Flask)
alembic upgrade head

# Or using Django
python manage.py migrate

Verify migration

-- Check new columns/tables exist
\d+ table_name

Monitor for issues (15-30 minutes)
- Application errors
- Performance degradation
- Data integrity issues

Rollback:

# Alembic rollback
alembic downgrade -1

# Or restore from snapshot (nuclear option)
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier complyai-prod-restored \
  --db-snapshot-identifier pre-migration-{timestamp}

RB-DB-002: Data Backfill

When to use: Populating new columns, fixing data issues

Steps:

Create backfill script

# backfill_example.py
import psycopg2
from tqdm import tqdm

BATCH_SIZE = 1000

conn = psycopg2.connect(DATABASE_URL)
cur = conn.cursor()

cur.execute("SELECT COUNT(*) FROM target_table WHERE new_column IS NULL")
total = cur.fetchone()[0]

for offset in tqdm(range(0, total, BATCH_SIZE)):
    cur.execute("""
        UPDATE target_table 
        SET new_column = computed_value
        WHERE id IN (
            SELECT id FROM target_table 
            WHERE new_column IS NULL 
            LIMIT %s
        )
    """, (BATCH_SIZE,))
    conn.commit()

Test on staging first
Run during low-traffic period
Monitor progress and database load

Best Practices:

Use batches (1000-5000 rows)
Include progress logging
Add sleep between batches if needed
Run ANALYZE after large backfills

Deployment Runbooks

RB-DEPLOY-001: Standard Service Deployment

Steps:

Merge PR to main branch
CI/CD automatically:
- Runs tests
- Builds Docker image
- Pushes to ECR
- Updates ECS task definition
- Triggers rolling deployment

Monitor deployment

aws ecs describe-services --cluster production --services {service} \
  --query 'services[].deployments'

Verify new version

curl https://api.complyai.io/health | jq '.version'

Monitor for 15 minutes
- Error rates
- Response times
- User reports

Rollback:

# Get previous task definition
aws ecs describe-services --cluster production --services {service} \
  --query 'services[].taskDefinition'

# Roll back to previous
aws ecs update-service --cluster production --service {service} \
  --task-definition {previous-revision}

RB-DEPLOY-002: Database Migration Deployment

Steps:

Deploy with migration flag disabled
Verify application works with old schema
Run migration (see RB-DB-001)
Deploy with migration flag enabled (if feature-flagged)
Monitor

Integration Runbooks

RB-INT-001: Meta API Rate Limit Recovery

When to use: Hitting Meta API rate limits (error code 4 or 17)

Symptoms:

Error: "Application request limit reached"
Error: "User request limit reached"

Diagnosis:

Check current rate limit status

curl "https://graph.facebook.com/v19.0/me?access_token={token}&debug=all"
# Look at x-business-use-case-usage header

Check ComplyAI request logs

SELECT COUNT(*), date_trunc('hour', created_at) as hour
FROM api_request_logs
WHERE service = 'meta'
AND created_at > NOW() - INTERVAL '24 hours'
GROUP BY hour
ORDER BY hour;

Remediation:

Enable rate limit backoff (if not already)
Reduce sync frequency temporarily
Prioritize high-value accounts
Wait for rate limit reset (typically 1 hour)

Prevention:

Implement exponential backoff
Use batch endpoints where possible
Cache responses where appropriate

RB-INT-002: Stripe Webhook Replay

When to use: Missed or failed Stripe webhooks

Steps:

Identify missed events in Stripe Dashboard
- Webhooks → Select endpoint → Events

Get event details

curl https://api.stripe.com/v1/events/{event_id} \
  -u sk_live_xxx:

Replay via Stripe Dashboard
- Click "Resend" on the event

Or process manually

# Fetch event and process
event = stripe.Event.retrieve(event_id)
handle_stripe_webhook(event)

Recovery Runbooks

RB-REC-001: Full Database Restore

When to use: Catastrophic data loss, corruption beyond repair

Prerequisites:

Identify target snapshot/point-in-time
Notify all stakeholders
Prepare for potential data loss

Steps:

Identify recovery point

# List available snapshots
aws rds describe-db-snapshots --db-instance-identifier complyai-prod

# Or check PITR window
aws rds describe-db-instances --db-instance-identifier complyai-prod \
  --query 'DBInstances[].LatestRestorableTime'

Restore to new instance

# From snapshot
aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier complyai-prod-restored \
  --db-snapshot-identifier {snapshot-id}

# Or point-in-time
aws rds restore-db-instance-to-point-in-time \
  --source-db-instance-identifier complyai-prod \
  --target-db-instance-identifier complyai-prod-restored \
  --restore-time {ISO-timestamp}

Verify restored data

-- Run data integrity checks
SELECT COUNT(*) FROM users;
SELECT COUNT(*) FROM organizations;
SELECT MAX(created_at) FROM activity_events;

Update application config to point to restored DB

Rename instances

# Rename old
aws rds modify-db-instance --db-instance-identifier complyai-prod \
  --new-db-instance-identifier complyai-prod-old

# Rename restored
aws rds modify-db-instance --db-instance-identifier complyai-prod-restored \
  --new-db-instance-identifier complyai-prod

Verify application connectivity
Document data loss window

RB-REC-002: Service Recovery from Scratch

When to use: Need to rebuild a service completely

Steps:

Ensure Docker image is available in ECR
Create new task definition (or use existing)

Create/update ECS service

aws ecs create-service \
  --cluster production \
  --service-name {service} \
  --task-definition {task-def} \
  --desired-count 2 \
  --launch-type FARGATE \
  --network-configuration "awsvpcConfiguration={subnets=[...],securityGroups=[...]}"

Update load balancer target group
Verify health checks passing
Update DNS if needed

Runbook Index

ID	Name	Severity	Category
RB-SYNC-001	Manual Ad Sync Trigger	P3	Data Sync
RB-SYNC-002	Token Refresh	P2	Data Sync
RB-SYNC-003	Webhook Resubscription	P2	Data Sync
RB-INC-001	Service Outage	P1/P2	Incident
RB-INC-002	Data Breach Response	P1	Incident
RB-INC-003	Database Performance	P2/P3	Incident
RB-DB-001	Schema Migration	P3	Database
RB-DB-002	Data Backfill	P3	Database
RB-DEPLOY-001	Standard Deployment	P3	Deployment
RB-DEPLOY-002	Migration Deployment	P2	Deployment
RB-INT-001	Meta Rate Limit	P2	Integration
RB-INT-002	Stripe Webhook Replay	P3	Integration
RB-REC-001	Full Database Restore	P1	Recovery
RB-REC-002	Service Recovery	P1	Recovery

Service Architecture - System design
Third-Party Integrations - External APIs
Data Governance - Incident response policies
Troubleshooting - Common issue resolution

Last Updated: December 2024

📋 Table of Contents​

Daily Operations​

Morning Health Check​

Ad Sync Monitoring​

Data Sync Runbooks​

RB-SYNC-001: Manual Ad Sync Trigger​

RB-SYNC-002: Token Refresh​

RB-SYNC-003: Webhook Resubscription​

Incident Response Runbooks​

RB-INC-001: Service Outage​

RB-INC-002: Data Breach Response​

RB-INC-003: Database Performance Degradation​

Database Operations​

RB-DB-001: Schema Migration​

RB-DB-002: Data Backfill​

Deployment Runbooks​

RB-DEPLOY-001: Standard Service Deployment​

RB-DEPLOY-002: Database Migration Deployment​

Integration Runbooks​

RB-INT-001: Meta API Rate Limit Recovery​

RB-INT-002: Stripe Webhook Replay​

Recovery Runbooks​

RB-REC-001: Full Database Restore​

RB-REC-002: Service Recovery from Scratch​

Runbook Index​

Related Documents​

📋 Table of Contents

Daily Operations

Morning Health Check

Ad Sync Monitoring

Data Sync Runbooks

RB-SYNC-001: Manual Ad Sync Trigger

RB-SYNC-002: Token Refresh

RB-SYNC-003: Webhook Resubscription

Incident Response Runbooks

RB-INC-001: Service Outage

RB-INC-002: Data Breach Response

RB-INC-003: Database Performance Degradation

Database Operations

RB-DB-001: Schema Migration

RB-DB-002: Data Backfill

Deployment Runbooks

RB-DEPLOY-001: Standard Service Deployment

RB-DEPLOY-002: Database Migration Deployment

Integration Runbooks

RB-INT-001: Meta API Rate Limit Recovery

RB-INT-002: Stripe Webhook Replay

Recovery Runbooks

RB-REC-001: Full Database Restore

RB-REC-002: Service Recovery from Scratch

Runbook Index

Related Documents