Troubleshooting Guide

Common issues and their solutions for ComplyAI platform

📋 Table of Contents

User-Reported Issues
Data Sync Issues
Authentication Issues
API Errors
Database Issues
External Integration Issues
Performance Issues
Deployment Issues

User-Reported Issues

"I can't see my ads"

Possible Causes & Solutions:

Cause	How to Check	Solution
Ad account not synced	Check `org_ad_accounts.last_sync`	Trigger manual sync
Sync failed	Check `activity_events` for errors	Review error, fix token if needed
Wrong organization	Check user's organization membership	Verify user_organizations table
Ad account disconnected	Check `org_ad_accounts.is_connected`	User needs to reconnect via OAuth
Recent ads (< 15 min)	Check ad creation time in Meta	Wait for next sync cycle

Diagnostic Query:

SELECT 
    oaa.id,
    oaa.name,
    oaa.is_connected,
    oaa.last_sync,
    COUNT(oa.id) as ad_count
FROM org_ad_accounts oaa
LEFT JOIN org_ads oa ON oaa.id = oa.org_ad_account_id
WHERE oaa.organization_id = {org_id}
GROUP BY oaa.id, oaa.name, oaa.is_connected, oaa.last_sync;

"My score is wrong"

Possible Causes & Solutions:

Cause	How to Check	Solution
Score not yet calculated	Check `org_ads_score` exists	Wait for scoring cycle (30 min)
Stale score	Check `org_ads_score.updated_at`	Trigger re-scoring
Ad content changed	Compare ad hash with scored version	Re-sync and re-score
Model version mismatch	Check score metadata	Verify model version

Diagnostic Query:

SELECT 
    oa.id,
    oa.name,
    oa.updated_at as ad_updated,
    oas.overall_score,
    oas.text_score,
    oas.media_score,
    oas.updated_at as score_updated
FROM org_ads oa
LEFT JOIN org_ads_score oas ON oa.id = oas.org_ad_id
WHERE oa.id = {ad_id};

"I can't connect my Meta account"

Possible Causes & Solutions:

Cause	How to Check	Solution
OAuth popup blocked	Browser settings	Enable popups for complyai.io
Missing Meta permissions	Check OAuth scope	User must grant all requested permissions
Meta account restricted	Meta Business Settings	User must resolve in Meta
Previous connection exists	Check org_business_accounts	Disconnect old connection first
Auth0 session expired	Check user session	User should log out and back in

OAuth Flow Diagram:

User clicks Connect → Auth0 redirects → Meta OAuth → 
User grants permissions → Redirect back → Token stored

"Notifications not working"

Possible Causes & Solutions:

Cause	How to Check	Solution
Email notifications disabled	Check user preferences	Enable in Settings
In-app notifications disabled	Check notification settings	Enable in Settings
Email in spam	Check spam folder	Whitelist @complyai.io
Notification service down	Check Triangle service health	Restart service if needed

Diagnostic Query:

SELECT 
    n.id,
    n.type,
    n.status,
    n.created_at,
    n.sent_at
FROM notifications n
WHERE n.user_id = {user_id}
ORDER BY n.created_at DESC
LIMIT 20;

Data Sync Issues

Ads Not Syncing

Symptoms:

org_ad_accounts.last_sync is stale (> 30 min old)
New ads in Meta not appearing in ComplyAI

Diagnostic Steps:

Check Celery worker status

# Are workers running?
celery -A complyai inspect active

Check queue depth

# Tasks waiting to be processed
celery -A complyai inspect reserved

Check for sync errors

SELECT * FROM activity_events
WHERE action IN ('ad_sync_started', 'ad_sync_failed', 'ad_sync_completed')
AND created_at > NOW() - INTERVAL '1 hour'
ORDER BY created_at DESC;

Check Meta API token validity

curl "https://graph.facebook.com/v19.0/me?access_token={token}"

Common Fixes:

Restart Celery workers: kubectl rollout restart deployment/celery-worker
Refresh token: See RB-SYNC-002 in Runbooks
Clear stuck tasks: celery -A complyai purge

Webhook Events Not Processing

Symptoms:

Meta shows webhook delivered
Events not appearing in database
Ad status changes not reflected

Diagnostic Steps:

Verify webhook endpoint is reachable

curl -X POST https://api.complyai.io/webhooks/meta/test

Check webhook logs

SELECT * FROM webhook_events
WHERE source = 'meta'
ORDER BY received_at DESC
LIMIT 20;

Verify signature validation
- Check HMAC signature matches
- Verify app secret is correct

Common Fixes:

Resubscribe webhooks: See RB-SYNC-003 in Runbooks
Check firewall/ALB allows Meta IPs
Verify webhook secret in environment

Authentication Issues

User Can't Log In

Symptoms:

Login page shows error
User redirected back to login
"Invalid credentials" message

Diagnostic Steps:

Check Auth0 logs
- Login to Auth0 Dashboard
- View Logs for user email

Check user status

SELECT id, email, is_active, auth0_user_id, created_at
FROM users
WHERE email = '{email}';

Check if user exists in Auth0
- Auth0 Dashboard → Users → Search

Common Fixes:

Issue	Solution
User disabled	Re-enable in Auth0
Password expired	User resets password
MFA issue	Reset MFA in Auth0
User not in database	Sync from Auth0 or re-register
Auth0 rules blocking	Check Auth0 rules

Token Expired

Symptoms:

API returns 401
"Token expired" error
User forced to re-login

For User JWT Tokens:

Normal behavior - user re-authenticates
Check Auth0 token lifetimes if too frequent

For Meta Access Tokens:

User tokens: 60-day expiry, user must re-auth
System user tokens: Should not expire, regenerate if needed

Diagnostic Query:

SELECT 
    oba.id,
    oba.business_id,
    oba.token_expires_at,
    CASE 
        WHEN oba.token_expires_at < NOW() THEN 'EXPIRED'
        WHEN oba.token_expires_at < NOW() + INTERVAL '7 days' THEN 'EXPIRING SOON'
        ELSE 'OK'
    END as status
FROM org_business_accounts oba
WHERE oba.organization_id = {org_id};

API Errors

Error Code Reference

Code	Meaning	Common Cause	Solution
400	Bad Request	Invalid parameters	Check request body/params
401	Unauthorized	Invalid/expired token	Re-authenticate
403	Forbidden	Insufficient permissions	Check user role
404	Not Found	Resource doesn't exist	Verify ID/path
409	Conflict	Duplicate resource	Check for existing record
422	Unprocessable	Validation failed	Check field requirements
429	Rate Limited	Too many requests	Implement backoff
500	Server Error	Application error	Check logs
502	Bad Gateway	Service unavailable	Check service health
503	Service Unavailable	Overloaded/maintenance	Retry later

Debugging API Requests

Request Tracing:

# Add request ID header for tracing
curl -H "X-Request-ID: debug-$(date +%s)" \
     -H "Authorization: Bearer {token}" \
     https://api.complyai.io/endpoint

# Find in logs
grep "debug-{timestamp}" /var/log/complyai/*.log

Common API Issues:

Issue	Symptom	Solution
Missing auth header	401 on all requests	Add `Authorization: Bearer {token}`
Wrong content type	400 or 415	Set `Content-Type: application/json`
Invalid JSON	400	Validate JSON syntax
Missing required field	422	Check API docs for required fields

Database Issues

Connection Pool Exhausted

Symptoms:

"Connection refused" errors
Timeouts on database operations
Application hangs

Diagnostic Steps:

Check active connections

SELECT count(*) FROM pg_stat_activity;

-- By application
SELECT application_name, count(*) 
FROM pg_stat_activity 
GROUP BY application_name;

Check for idle connections

SELECT pid, usename, application_name, state, query_start
FROM pg_stat_activity
WHERE state = 'idle'
ORDER BY query_start;

Check PgBouncer stats (if used)

psql -h pgbouncer -p 6432 pgbouncer -c "SHOW POOLS;"

Common Fixes:

Kill idle connections: SELECT pg_terminate_backend(pid);
Increase pool size (with caution)
Check for connection leaks in code
Restart application pods

Slow Queries

Symptoms:

High latency on specific endpoints
Database CPU spikes
Timeout errors

Diagnostic Steps:

Find slow queries

SELECT pid, now() - query_start as duration, query
FROM pg_stat_activity
WHERE state = 'active'
AND query NOT LIKE '%pg_stat_activity%'
ORDER BY duration DESC;

Check query plan
```
EXPLAIN ANALYZE {slow_query};
```

Check for missing indexes

SELECT relname, seq_scan, idx_scan
FROM pg_stat_user_tables
WHERE seq_scan > idx_scan
ORDER BY seq_scan DESC;

Common Fixes:

Add missing index: CREATE INDEX CONCURRENTLY idx_name ON table(column);
Update statistics: ANALYZE table_name;
Rewrite inefficient query
Add query timeout

Deadlocks

Symptoms:

"deadlock detected" errors
Transactions rolling back
Intermittent failures on writes

Diagnostic Steps:

Check for locks

SELECT blocked_locks.pid AS blocked_pid,
       blocking_locks.pid AS blocking_pid,
       blocked_activity.query AS blocked_query,
       blocking_activity.query AS blocking_query
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_locks blocking_locks 
  ON blocking_locks.locktype = blocked_locks.locktype
  AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
  AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;

Common Fixes:

Ensure consistent ordering of table access
Reduce transaction duration
Use SELECT FOR UPDATE SKIP LOCKED
Implement retry logic in application

External Integration Issues

Meta API Errors

Error Code	Meaning	Solution
4	Application rate limit	Reduce request frequency, implement backoff
17	User rate limit	Wait for reset (1 hour)
100	Invalid parameter	Check API parameters
190	Access token expired	Refresh token
200	Permission denied	User must grant permission
278	Temporary issue	Retry after delay

Diagnostic Steps:

Check rate limit status

curl "https://graph.facebook.com/v19.0/me?access_token={token}&debug=all"
# Check x-business-use-case-usage header

Validate token

curl "https://graph.facebook.com/debug_token?input_token={token}&access_token={app_token}"

Stripe Errors

Error	Meaning	Solution
card_declined	Payment failed	Customer updates payment
expired_card	Card expired	Customer updates card
invalid_request_error	Bad API call	Check request parameters
authentication_error	Bad API key	Verify Stripe key
rate_limit_error	Too many requests	Implement backoff

Performance Issues

High Latency

Symptoms:

Slow page loads
API response times > 500ms
User complaints about speed

Diagnostic Steps:

Check which service is slow
- Review CloudWatch latency metrics by service
- Check individual service health endpoints

Check database performance

-- Average query time
SELECT datname, calls, total_time/calls as avg_time
FROM pg_stat_statements
ORDER BY total_time DESC LIMIT 20;

Check external API latency
- Review third-party status pages
- Check timeout configurations

Common Causes & Fixes:

Cause	Indicator	Fix
Database	High DB latency	Add indexes, optimize queries
External API	High Meta/Stripe latency	Add caching, increase timeouts
Application	CPU-bound	Scale horizontally, optimize code
Network	High latency between services	Check VPC configuration

Memory Issues

Symptoms:

OOM (Out of Memory) kills
Service restarts
Increasing memory usage over time

Diagnostic Steps:

Check container memory usage

# ECS
aws cloudwatch get-metric-statistics --namespace AWS/ECS \
  --metric-name MemoryUtilization --dimensions Name=ServiceName,Value={service}

Check for memory leaks
- Monitor memory over time
- Check for growing object counts

Common Fixes:

Increase container memory limits
Fix memory leaks in code
Implement proper connection cleanup
Add request timeouts

Celery Queue Backlog

Symptoms:

Tasks queuing up
Delayed processing
Flower shows large queue

Diagnostic Steps:

Check queue depth

celery -A complyai inspect active
celery -A complyai inspect reserved

Check worker status
```
celery -A complyai inspect ping
```
Check for failed tasks
```
celery -A complyai inspect failed
```

Common Fixes:

Scale up workers
Increase worker concurrency
Clear stuck tasks: celery -A complyai purge
Fix failing tasks blocking queue

Deployment Issues

Deployment Failing

Symptoms:

ECS deployment stuck
Health checks failing
Rollback triggered

Diagnostic Steps:

Check deployment status

aws ecs describe-services --cluster production --services {service}

Check task failures

aws ecs describe-tasks --cluster production --tasks {task_arn}

Check container logs

aws logs get-log-events --log-group-name /ecs/{service} \
  --log-stream-name {stream}

Common Causes:

Issue	Symptom	Fix
Health check failing	Tasks start then stop	Fix health endpoint, check port
Missing env var	Application crash on start	Add to task definition
Resource constraints	Task won't start	Increase CPU/memory
Image pull failed	ECR auth error	Refresh ECR credentials

Rollback Procedure

Identify last working task definition

aws ecs describe-task-definition --task-definition {service} \
  --query 'taskDefinition.revision'

Deploy previous version

aws ecs update-service --cluster production --service {service} \
  --task-definition {service}:{previous_revision}

Monitor rollback

aws ecs wait services-stable --cluster production --services {service}

Quick Troubleshooting Checklist

General Investigation Order

Is it down for everyone or just one user?
- Check status page
- Try different user/account
What changed recently?
- Recent deployments
- Configuration changes
- External service issues
Check the obvious first
- Service health
- Database connectivity
- External dependencies
Follow the data
- Trace request through services
- Check each layer (UI → API → DB → External)
Check logs
- Application logs
- Error rates
- Recent errors

Escalation Path

Level	When	Who
L1	First responder, common issues	On-call engineer
L2	Complex issues, database	Senior engineer
L3	Architecture issues, major incidents	Engineering lead
Executive	Customer-impacting, data breach	CTO/CEO

Runbooks - Detailed procedures
Data Governance - Incident response policies
Service Architecture - System design
Quick Reference - Status codes and lookups

Last Updated: December 2024

📋 Table of Contents​

User-Reported Issues​

"I can't see my ads"​

"My score is wrong"​

"I can't connect my Meta account"​

"Notifications not working"​

Data Sync Issues​

Ads Not Syncing​

Webhook Events Not Processing​

Authentication Issues​

User Can't Log In​

Token Expired​

API Errors​

Error Code Reference​

Debugging API Requests​

Database Issues​

Connection Pool Exhausted​

Slow Queries​

Deadlocks​

External Integration Issues​

Meta API Errors​

Stripe Errors​

Performance Issues​

High Latency​

Memory Issues​

Celery Queue Backlog​

Deployment Issues​

Deployment Failing​

Rollback Procedure​

Quick Troubleshooting Checklist​

General Investigation Order​

Escalation Path​

Related Documents​

📋 Table of Contents

User-Reported Issues

"I can't see my ads"

"My score is wrong"

"I can't connect my Meta account"

"Notifications not working"

Data Sync Issues

Ads Not Syncing

Webhook Events Not Processing

Authentication Issues

User Can't Log In

Token Expired

API Errors

Error Code Reference

Debugging API Requests

Database Issues

Connection Pool Exhausted

Slow Queries

Deadlocks

External Integration Issues

Meta API Errors

Stripe Errors

Performance Issues

High Latency

Memory Issues

Celery Queue Backlog

Deployment Issues

Deployment Failing

Rollback Procedure

Quick Troubleshooting Checklist

General Investigation Order

Escalation Path

Related Documents