Troubleshooting Guide
Common issues and their solutions for ComplyAI platform
📋 Table of Contents​
- User-Reported Issues
- Data Sync Issues
- Authentication Issues
- API Errors
- Database Issues
- External Integration Issues
- Performance Issues
- Deployment Issues
User-Reported Issues​
"I can't see my ads"​
Possible Causes & Solutions:
| Cause | How to Check | Solution |
|---|---|---|
| Ad account not synced | Check org_ad_accounts.last_sync | Trigger manual sync |
| Sync failed | Check activity_events for errors | Review error, fix token if needed |
| Wrong organization | Check user's organization membership | Verify user_organizations table |
| Ad account disconnected | Check org_ad_accounts.is_connected | User needs to reconnect via OAuth |
| Recent ads (< 15 min) | Check ad creation time in Meta | Wait for next sync cycle |
Diagnostic Query:
SELECT
oaa.id,
oaa.name,
oaa.is_connected,
oaa.last_sync,
COUNT(oa.id) as ad_count
FROM org_ad_accounts oaa
LEFT JOIN org_ads oa ON oaa.id = oa.org_ad_account_id
WHERE oaa.organization_id = {org_id}
GROUP BY oaa.id, oaa.name, oaa.is_connected, oaa.last_sync;
"My score is wrong"​
Possible Causes & Solutions:
| Cause | How to Check | Solution |
|---|---|---|
| Score not yet calculated | Check org_ads_score exists | Wait for scoring cycle (30 min) |
| Stale score | Check org_ads_score.updated_at | Trigger re-scoring |
| Ad content changed | Compare ad hash with scored version | Re-sync and re-score |
| Model version mismatch | Check score metadata | Verify model version |
Diagnostic Query:
SELECT
oa.id,
oa.name,
oa.updated_at as ad_updated,
oas.overall_score,
oas.text_score,
oas.media_score,
oas.updated_at as score_updated
FROM org_ads oa
LEFT JOIN org_ads_score oas ON oa.id = oas.org_ad_id
WHERE oa.id = {ad_id};
"I can't connect my Meta account"​
Possible Causes & Solutions:
| Cause | How to Check | Solution |
|---|---|---|
| OAuth popup blocked | Browser settings | Enable popups for complyai.io |
| Missing Meta permissions | Check OAuth scope | User must grant all requested permissions |
| Meta account restricted | Meta Business Settings | User must resolve in Meta |
| Previous connection exists | Check org_business_accounts | Disconnect old connection first |
| Auth0 session expired | Check user session | User should log out and back in |
OAuth Flow Diagram:
User clicks Connect → Auth0 redirects → Meta OAuth →
User grants permissions → Redirect back → Token stored
"Notifications not working"​
Possible Causes & Solutions:
| Cause | How to Check | Solution |
|---|---|---|
| Email notifications disabled | Check user preferences | Enable in Settings |
| In-app notifications disabled | Check notification settings | Enable in Settings |
| Email in spam | Check spam folder | Whitelist @complyai.io |
| Notification service down | Check Triangle service health | Restart service if needed |
Diagnostic Query:
SELECT
n.id,
n.type,
n.status,
n.created_at,
n.sent_at
FROM notifications n
WHERE n.user_id = {user_id}
ORDER BY n.created_at DESC
LIMIT 20;
Data Sync Issues​
Ads Not Syncing​
Symptoms:
org_ad_accounts.last_syncis stale (> 30 min old)- New ads in Meta not appearing in ComplyAI
Diagnostic Steps:
-
Check Celery worker status
# Are workers running?
celery -A complyai inspect active -
Check queue depth
# Tasks waiting to be processed
celery -A complyai inspect reserved -
Check for sync errors
SELECT * FROM activity_events
WHERE action IN ('ad_sync_started', 'ad_sync_failed', 'ad_sync_completed')
AND created_at > NOW() - INTERVAL '1 hour'
ORDER BY created_at DESC; -
Check Meta API token validity
curl "https://graph.facebook.com/v19.0/me?access_token={token}"
Common Fixes:
- Restart Celery workers:
kubectl rollout restart deployment/celery-worker - Refresh token: See RB-SYNC-002 in Runbooks
- Clear stuck tasks:
celery -A complyai purge
Webhook Events Not Processing​
Symptoms:
- Meta shows webhook delivered
- Events not appearing in database
- Ad status changes not reflected
Diagnostic Steps:
-
Verify webhook endpoint is reachable
curl -X POST https://api.complyai.io/webhooks/meta/test -
Check webhook logs
SELECT * FROM webhook_events
WHERE source = 'meta'
ORDER BY received_at DESC
LIMIT 20; -
Verify signature validation
- Check HMAC signature matches
- Verify app secret is correct
Common Fixes:
- Resubscribe webhooks: See RB-SYNC-003 in Runbooks
- Check firewall/ALB allows Meta IPs
- Verify webhook secret in environment
Authentication Issues​
User Can't Log In​
Symptoms:
- Login page shows error
- User redirected back to login
- "Invalid credentials" message
Diagnostic Steps:
-
Check Auth0 logs
- Login to Auth0 Dashboard
- View Logs for user email
-
Check user status
SELECT id, email, is_active, auth0_user_id, created_at
FROM users
WHERE email = '{email}'; -
Check if user exists in Auth0
- Auth0 Dashboard → Users → Search
Common Fixes:
| Issue | Solution |
|---|---|
| User disabled | Re-enable in Auth0 |
| Password expired | User resets password |
| MFA issue | Reset MFA in Auth0 |
| User not in database | Sync from Auth0 or re-register |
| Auth0 rules blocking | Check Auth0 rules |
Token Expired​
Symptoms:
- API returns 401
- "Token expired" error
- User forced to re-login
For User JWT Tokens:
- Normal behavior - user re-authenticates
- Check Auth0 token lifetimes if too frequent
For Meta Access Tokens:
- User tokens: 60-day expiry, user must re-auth
- System user tokens: Should not expire, regenerate if needed
Diagnostic Query:
SELECT
oba.id,
oba.business_id,
oba.token_expires_at,
CASE
WHEN oba.token_expires_at < NOW() THEN 'EXPIRED'
WHEN oba.token_expires_at < NOW() + INTERVAL '7 days' THEN 'EXPIRING SOON'
ELSE 'OK'
END as status
FROM org_business_accounts oba
WHERE oba.organization_id = {org_id};
API Errors​
Error Code Reference​
| Code | Meaning | Common Cause | Solution |
|---|---|---|---|
| 400 | Bad Request | Invalid parameters | Check request body/params |
| 401 | Unauthorized | Invalid/expired token | Re-authenticate |
| 403 | Forbidden | Insufficient permissions | Check user role |
| 404 | Not Found | Resource doesn't exist | Verify ID/path |
| 409 | Conflict | Duplicate resource | Check for existing record |
| 422 | Unprocessable | Validation failed | Check field requirements |
| 429 | Rate Limited | Too many requests | Implement backoff |
| 500 | Server Error | Application error | Check logs |
| 502 | Bad Gateway | Service unavailable | Check service health |
| 503 | Service Unavailable | Overloaded/maintenance | Retry later |
Debugging API Requests​
Request Tracing:
# Add request ID header for tracing
curl -H "X-Request-ID: debug-$(date +%s)" \
-H "Authorization: Bearer {token}" \
https://api.complyai.io/endpoint
# Find in logs
grep "debug-{timestamp}" /var/log/complyai/*.log
Common API Issues:
| Issue | Symptom | Solution |
|---|---|---|
| Missing auth header | 401 on all requests | Add Authorization: Bearer {token} |
| Wrong content type | 400 or 415 | Set Content-Type: application/json |
| Invalid JSON | 400 | Validate JSON syntax |
| Missing required field | 422 | Check API docs for required fields |
Database Issues​
Connection Pool Exhausted​
Symptoms:
- "Connection refused" errors
- Timeouts on database operations
- Application hangs
Diagnostic Steps:
-
Check active connections
SELECT count(*) FROM pg_stat_activity;
-- By application
SELECT application_name, count(*)
FROM pg_stat_activity
GROUP BY application_name; -
Check for idle connections
SELECT pid, usename, application_name, state, query_start
FROM pg_stat_activity
WHERE state = 'idle'
ORDER BY query_start; -
Check PgBouncer stats (if used)
psql -h pgbouncer -p 6432 pgbouncer -c "SHOW POOLS;"
Common Fixes:
- Kill idle connections:
SELECT pg_terminate_backend(pid); - Increase pool size (with caution)
- Check for connection leaks in code
- Restart application pods
Slow Queries​
Symptoms:
- High latency on specific endpoints
- Database CPU spikes
- Timeout errors
Diagnostic Steps:
-
Find slow queries
SELECT pid, now() - query_start as duration, query
FROM pg_stat_activity
WHERE state = 'active'
AND query NOT LIKE '%pg_stat_activity%'
ORDER BY duration DESC; -
Check query plan
EXPLAIN ANALYZE {slow_query}; -
Check for missing indexes
SELECT relname, seq_scan, idx_scan
FROM pg_stat_user_tables
WHERE seq_scan > idx_scan
ORDER BY seq_scan DESC;
Common Fixes:
- Add missing index:
CREATE INDEX CONCURRENTLY idx_name ON table(column); - Update statistics:
ANALYZE table_name; - Rewrite inefficient query
- Add query timeout
Deadlocks​
Symptoms:
- "deadlock detected" errors
- Transactions rolling back
- Intermittent failures on writes
Diagnostic Steps:
- Check for locks
SELECT blocked_locks.pid AS blocked_pid,
blocking_locks.pid AS blocking_pid,
blocked_activity.query AS blocked_query,
blocking_activity.query AS blocking_query
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_locks blocking_locks
ON blocking_locks.locktype = blocked_locks.locktype
AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;
Common Fixes:
- Ensure consistent ordering of table access
- Reduce transaction duration
- Use
SELECT FOR UPDATE SKIP LOCKED - Implement retry logic in application
External Integration Issues​
Meta API Errors​
| Error Code | Meaning | Solution |
|---|---|---|
| 4 | Application rate limit | Reduce request frequency, implement backoff |
| 17 | User rate limit | Wait for reset (1 hour) |
| 100 | Invalid parameter | Check API parameters |
| 190 | Access token expired | Refresh token |
| 200 | Permission denied | User must grant permission |
| 278 | Temporary issue | Retry after delay |
Diagnostic Steps:
-
Check rate limit status
curl "https://graph.facebook.com/v19.0/me?access_token={token}&debug=all"
# Check x-business-use-case-usage header -
Validate token
curl "https://graph.facebook.com/debug_token?input_token={token}&access_token={app_token}"
Stripe Errors​
| Error | Meaning | Solution |
|---|---|---|
| card_declined | Payment failed | Customer updates payment |
| expired_card | Card expired | Customer updates card |
| invalid_request_error | Bad API call | Check request parameters |
| authentication_error | Bad API key | Verify Stripe key |
| rate_limit_error | Too many requests | Implement backoff |
Performance Issues​
High Latency​
Symptoms:
- Slow page loads
- API response times > 500ms
- User complaints about speed
Diagnostic Steps:
-
Check which service is slow
- Review CloudWatch latency metrics by service
- Check individual service health endpoints
-
Check database performance
-- Average query time
SELECT datname, calls, total_time/calls as avg_time
FROM pg_stat_statements
ORDER BY total_time DESC LIMIT 20; -
Check external API latency
- Review third-party status pages
- Check timeout configurations
Common Causes & Fixes:
| Cause | Indicator | Fix |
|---|---|---|
| Database | High DB latency | Add indexes, optimize queries |
| External API | High Meta/Stripe latency | Add caching, increase timeouts |
| Application | CPU-bound | Scale horizontally, optimize code |
| Network | High latency between services | Check VPC configuration |
Memory Issues​
Symptoms:
- OOM (Out of Memory) kills
- Service restarts
- Increasing memory usage over time
Diagnostic Steps:
-
Check container memory usage
# ECS
aws cloudwatch get-metric-statistics --namespace AWS/ECS \
--metric-name MemoryUtilization --dimensions Name=ServiceName,Value={service} -
Check for memory leaks
- Monitor memory over time
- Check for growing object counts
Common Fixes:
- Increase container memory limits
- Fix memory leaks in code
- Implement proper connection cleanup
- Add request timeouts
Celery Queue Backlog​
Symptoms:
- Tasks queuing up
- Delayed processing
- Flower shows large queue
Diagnostic Steps:
-
Check queue depth
celery -A complyai inspect active
celery -A complyai inspect reserved -
Check worker status
celery -A complyai inspect ping -
Check for failed tasks
celery -A complyai inspect failed
Common Fixes:
- Scale up workers
- Increase worker concurrency
- Clear stuck tasks:
celery -A complyai purge - Fix failing tasks blocking queue
Deployment Issues​
Deployment Failing​
Symptoms:
- ECS deployment stuck
- Health checks failing
- Rollback triggered
Diagnostic Steps:
-
Check deployment status
aws ecs describe-services --cluster production --services {service} -
Check task failures
aws ecs describe-tasks --cluster production --tasks {task_arn} -
Check container logs
aws logs get-log-events --log-group-name /ecs/{service} \
--log-stream-name {stream}
Common Causes:
| Issue | Symptom | Fix |
|---|---|---|
| Health check failing | Tasks start then stop | Fix health endpoint, check port |
| Missing env var | Application crash on start | Add to task definition |
| Resource constraints | Task won't start | Increase CPU/memory |
| Image pull failed | ECR auth error | Refresh ECR credentials |
Rollback Procedure​
-
Identify last working task definition
aws ecs describe-task-definition --task-definition {service} \
--query 'taskDefinition.revision' -
Deploy previous version
aws ecs update-service --cluster production --service {service} \
--task-definition {service}:{previous_revision} -
Monitor rollback
aws ecs wait services-stable --cluster production --services {service}
Quick Troubleshooting Checklist​
General Investigation Order​
-
Is it down for everyone or just one user?
- Check status page
- Try different user/account
-
What changed recently?
- Recent deployments
- Configuration changes
- External service issues
-
Check the obvious first
- Service health
- Database connectivity
- External dependencies
-
Follow the data
- Trace request through services
- Check each layer (UI → API → DB → External)
-
Check logs
- Application logs
- Error rates
- Recent errors
Escalation Path​
| Level | When | Who |
|---|---|---|
| L1 | First responder, common issues | On-call engineer |
| L2 | Complex issues, database | Senior engineer |
| L3 | Architecture issues, major incidents | Engineering lead |
| Executive | Customer-impacting, data breach | CTO/CEO |
Related Documents​
- Runbooks - Detailed procedures
- Data Governance - Incident response policies
- Service Architecture - System design
- Quick Reference - Status codes and lookups
Last Updated: December 2024