ComplyAI Data Infrastructure

This document outlines the data architecture, databases, and key data models used across the ComplyAI infrastructure.

Database Overview

The infrastructure primarily relies on PostgreSQL for relational data and ChromaDB for vector embeddings (AI/ML).

Service	Database	ORM / Driver	Notes
`complyai-api`	PostgreSQL	Flask-SQLAlchemy	Main business data
`api-async`	PostgreSQL	AsyncPG	High-performance async access
`complyai_cms`	PostgreSQL	Flask-SQLAlchemy	CMS content data
`complyai-ipu`	ChromaDB	N/A	Vector embeddings for AI
`complyai-violin`	ChromaDB	N/A	Vector embeddings for AI

Located in app/models/core_models.py.

SpendClientAdAccounts: Tracks client ad accounts.
SpendClientAdAccountsSpendData: Stores spend data associated with accounts.
LineOfCreditAdAccounts: Manages credit lines for ad accounts.
LineOfCreditAdAccountSpendData: Spend data specific to credit lines.
FacebookCurrentAdData: Real-time ad data from Facebook.

Located in www/<app>/models.py. Follows a modular app structure.

ChromaDB: Used by complyai-ipu and complyai-violin to store embeddings for semantic search and RAG (Retrieval-Augmented Generation) flows.
Models: Uses HuggingFace transformers and torch for embedding generation.

Ingestion: complyai-api ingests data via Webhooks (FacebookWebhooks) and Scrapers.
Processing: api-async or Celery workers (in complyai-api) process raw data.
Storage:
- Structured data -> PostgreSQL (SpendData, AdAccounts).
- Unstructured/Embedding data -> ChromaDB via complyai-ipu.
Presentation: complyai-frontend and www fetch data via API endpoints.