Data Storage Architecture
Overview
The CROW Data Storage Architecture is built entirely on Cloudflare's edge infrastructure, providing efficient storage, processing, and retrieval of interaction data from various sources including websites, CCTV, and social media. This architecture leverages D1 (relational database), R2 (object storage), and Vectorize (vector database) for a unified, globally distributed storage solution.
Architecture Components
1. D1 Database
Cloudflare D1 serves as the primary relational database for structured data:
- Structured Data: Users, organizations, products, API keys
- Interaction Metadata: Session info, timestamps, sources
- Processing State: Job status, workflow state
- Configuration: Settings, preferences, rules
Key Features:
- SQLite-based with familiar SQL syntax
- Automatic global replication
- Point-in-time recovery
- Sub-millisecond read latency at edge
2. R2 Object Storage
Cloudflare R2 provides S3-compatible object storage:
- Raw Interactions: Original event payloads
- CCTV Frames: Extracted video frames
- Export Files: Generated reports (PDF, CSV)
- Assets: Images, documents, attachments
Key Features:
- Zero egress fees
- S3 API compatibility
- Automatic global distribution
- Lifecycle policies for data retention
3. Vectorize
Cloudflare Vectorize provides AI-native vector storage and search:
- Interaction Embeddings: Vectorized interaction text from all sources
- Product Embeddings: Product description vectors
- Social Embeddings: Social media content vectors
- Organization Context: Organization-specific context embeddings
- Pattern Embeddings: Discovered pattern vectors
Key Features:
- High-dimensional vector storage (1536 dimensions for Gemini embeddings)
- Fast similarity search
- Metadata filtering
- Automatic index optimization
- Used by Chat Service for semantic search
4. Durable Objects
Cloudflare Durable Objects for stateful session management:
- Web Session State: Active user session tracking
- Inactivity Alarms: Automatic 1-hour inactivity triggers
- Builder State: Signup flow state management
Key Features:
- Strong consistency
- Low-latency state access
- Automatic alarm triggers
- Per-session isolation
Data Flow
Ingestion Flow
Storage Architecture
Retrieval Flow
Detailed Process Flow
1. New Interaction Processing
When a new interaction is received from any component (Web, Social, CCTV):
- Reception: Component Worker or Durable Object receives raw data
- Queue Dispatch: Session data sent to Interaction Queue
- Interaction Service: Consumes queue message
- Context Retrieval: Fetches Organization and Product context
- AI Processing: Gemini AI analyzes data via Vercel AI SDK
- Storage:
- D1: Metadata record with unique ID, source, timestamps
- R2: Raw payload (optional, for archival)
- Vectorize: Text embeddings for semantic search
- Analytics: Notifies Analytics Service for billing tracking
2. Pattern Discovery
Pattern Service processes interactions periodically:
- Cron Trigger: Daily execution
- Retrieve Interactions: Last 24 hours from D1/Vectorize
- Context Enrichment: Add Organization and Product context
- AI Analysis: Gemini AI identifies patterns via Vercel AI SDK
- Storage:
- D1: Pattern metadata with time period (day, week, month, year)
- Vectorize: Pattern embeddings
- Self-triggering: Queue messages for multi-period aggregation
3. Data Retrieval Process
When retrieving data (Dashboard, Chat, API):
- Authentication: Gateway validates session/API key, issues JWT
- Pre-filtering: Service queries D1 by criteria (date, product, source, org)
- Semantic Search: Vectorize performs vector similarity search
- Results Assembly: Matching IDs fetch complete metadata from D1
- Optional Raw Data: Retrieve original content from R2 if needed
- Permission Check: Validate user has access to requested data
Database Schema (D1)
Core Tables
-- Organizations
CREATE TABLE organizations (
id TEXT PRIMARY KEY,
name TEXT NOT NULL,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
settings JSON
);
-- Users (1:1 with Organizations)
CREATE TABLE users (
id TEXT PRIMARY KEY,
org_id TEXT NOT NULL REFERENCES organizations(id),
email TEXT UNIQUE NOT NULL,
username TEXT,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);
-- User Permissions
CREATE TABLE user_permissions (
user_id TEXT PRIMARY KEY REFERENCES users(id),
-- Chat constraints
chat_enabled BOOLEAN DEFAULT true,
chat_min_duration TEXT DEFAULT '1yr',
chat_components JSON DEFAULT '["web"]',
-- Feature toggles
interactions_enabled BOOLEAN DEFAULT true,
patterns_enabled BOOLEAN DEFAULT true,
teams_enabled BOOLEAN DEFAULT false,
api_keys_enabled BOOLEAN DEFAULT false,
billing_enabled BOOLEAN DEFAULT false,
settings_enabled BOOLEAN DEFAULT false,
notifications_enabled BOOLEAN DEFAULT true,
integrations_enabled BOOLEAN DEFAULT false,
products_enabled BOOLEAN DEFAULT false,
updated_at DATETIME DEFAULT CURRENT_TIMESTAMP
);
-- API Keys
CREATE TABLE api_keys (
id TEXT PRIMARY KEY,
org_id TEXT NOT NULL REFERENCES organizations(id),
key_hash TEXT NOT NULL,
name TEXT,
expires_at DATETIME,
-- Permissions
interactions_enabled BOOLEAN DEFAULT true,
patterns_enabled BOOLEAN DEFAULT true,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
last_used DATETIME
);
-- Products
CREATE TABLE products (
id TEXT PRIMARY KEY,
org_id TEXT REFERENCES organizations(id),
name TEXT NOT NULL,
description TEXT,
metadata JSON
);
-- Interactions
CREATE TABLE interactions (
id TEXT PRIMARY KEY,
org_id TEXT NOT NULL REFERENCES organizations(id),
session_id TEXT NOT NULL,
source TEXT NOT NULL, -- 'web', 'social', 'cctv'
description TEXT, -- AI-generated textual description
products JSON, -- Array of product IDs involved
created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
metadata JSON,
vector_id TEXT -- Reference to Vectorize embedding
);
-- Patterns
CREATE TABLE patterns (
id TEXT PRIMARY KEY,
org_id TEXT NOT NULL REFERENCES organizations(id),
time_period TEXT NOT NULL, -- 'day', 'week', 'month', 'year', 'all_time'
description TEXT, -- AI-generated pattern description
products JSON, -- Array of product IDs involved
confidence REAL, -- Pattern confidence score
created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
metadata JSON,
vector_id TEXT
);
-- Chat History
CREATE TABLE chat_sessions (
id TEXT PRIMARY KEY,
user_id TEXT NOT NULL REFERENCES users(id),
org_id TEXT NOT NULL REFERENCES organizations(id),
title TEXT,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
updated_at DATETIME
);
CREATE TABLE chat_messages (
id TEXT PRIMARY KEY,
session_id TEXT NOT NULL REFERENCES chat_sessions(id),
role TEXT NOT NULL, -- 'user' or 'assistant'
content TEXT NOT NULL,
artifacts JSON, -- Generated files, charts, etc.
created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);
-- Notifications
CREATE TABLE notifications (
id TEXT PRIMARY KEY,
user_id TEXT NOT NULL REFERENCES users(id),
type TEXT NOT NULL,
title TEXT NOT NULL,
content TEXT,
read BOOLEAN DEFAULT false,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);
-- Social Media Link Tracking
CREATE TABLE seen_links (
url TEXT PRIMARY KEY,
org_id TEXT NOT NULL,
first_seen DATETIME DEFAULT CURRENT_TIMESTAMP,
last_checked DATETIME,
content_hash TEXT,
source TEXT -- 'search' or 'direct'
);
R2 Bucket Structure
crow-storage/
├── raw/
│ ├── interactions/
│ │ └── {org_id}/{component}/{date}/{interaction_id}.json
│ └── sessions/
│ └── {org_id}/{component}/{session_id}.json
├── chat/
│ ├── artifacts/
│ │ └── {org_id}/{chat_id}/{artifact_id}
│ └── assets/
│ └── {org_id}/{chat_id}/{asset_id}
├── exports/
│ └── {org_id}/{export_id}/{filename}
└── products/
└── {org_id}/{product_id}/{image}
Directory Purposes:
raw/: Optional raw interaction and session data archivalchat/: Chat-generated artifacts (reports, charts) and assets (images)exports/: User-requested data exports (CSV, PDF, JSON)products/: Product images and media
Vectorize Indexes
interaction-embeddings
- Dimensions: 1536 (Gemini embeddings)
- Metric: Cosine similarity
- Metadata: org_id, source (web/social/cctv), session_id, created_at
- Purpose: Semantic search across all interactions
pattern-embeddings
- Dimensions: 1536 (Gemini embeddings)
- Metric: Cosine similarity
- Metadata: org_id, time_period, confidence, created_at
- Purpose: Pattern discovery and trend analysis
product-embeddings
- Dimensions: 1536 (Gemini embeddings)
- Metric: Cosine similarity
- Metadata: org_id, category, price_range
- Purpose: Product similarity and recommendations
organization-context-embeddings
- Dimensions: 1536 (Gemini embeddings)
- Metric: Cosine similarity
- Metadata: org_id, context_type, updated_at
- Purpose: Organization-specific context for AI processing
All embeddings generated via Gemini AI through Vercel AI SDK
Technology Stack
| Component | Cloudflare Service | Purpose |
|---|---|---|
| Relational DB | D1 | Structured data, metadata |
| Object Storage | R2 | Raw data, files, exports, chat artifacts |
| Vector DB | Vectorize | Semantic search, AI embeddings |
| Session State | Durable Objects | Web session management, builder state |
| Caching | Workers KV | Session cache, JWT mapping (1-5 min TTL) |
| Queuing | Queues | Async processing, service communication |
AI Technology:
- Models: Google Gemini (all AI processing)
- Framework: Vercel AI SDK (agent orchestration)
- Gateway: Cloudflare AI Gateway (LLM request routing)
Advantages
Cloudflare Platform Benefits
- Global Distribution: Data served from 300+ edge locations
- Zero Egress: No fees for data transfer out of R2
- Integrated: Seamless connectivity between D1, R2, Vectorize
- Simplified Operations: No server management required
- Cost Effective: Pay-per-use pricing model
Architecture Benefits
- Separation of Concerns: Relational, object, and vector data stored optimally
- Flexibility: Each layer optimized for its specific use case
- Scalability: Cloudflare manages scaling automatically
- Performance: Edge distribution ensures low latency globally
Data Retention & Lifecycle
Retention Policies
- Interactions Metadata: Retained indefinitely in D1
- Pattern Metadata: Retained indefinitely (all time periods)
- Raw Interaction Data: Optional archival in R2 (30 days hot, then archive)
- Chat History: Retained indefinitely
- Chat Artifacts: 30 days in R2
- Exports: 7 days in R2, auto-deleted after
- Session Cache: 1-5 minutes TTL in Workers KV
Privacy Compliance:
- GDPR: User can request data deletion from settings
- Organization Context: Clearable from dashboard
- CCTV Data: Not stored long-term (processed in real-time)
R2 Lifecycle Rules
{
"rules": [
{
"id": "archive-raw-interactions",
"filter": { "prefix": "raw/interactions/" },
"transitions": [
{ "days": 30, "storageClass": "ARCHIVE" }
]
},
{
"id": "delete-chat-artifacts",
"filter": { "prefix": "chat/" },
"expiration": { "days": 30 }
},
{
"id": "delete-exports",
"filter": { "prefix": "exports/" },
"expiration": { "days": 7 }
}
]
}
Future Considerations
Scalability Planning
- Monitor D1 database size and query patterns
- Implement data partitioning if needed
- Plan for multi-region data residency requirements
Performance Optimization
- Implement caching layer with Workers KV
- Optimize Vectorize index configuration
- Consider hybrid search (vector + keyword)
Compliance
- GDPR data deletion workflows
- Data residency controls
- Audit logging for access
Glossary
- D1: Cloudflare's serverless SQLite database
- R2: Cloudflare's S3-compatible object storage
- Vectorize: Cloudflare's vector database for AI embeddings
- Embeddings: Numerical vector representations of text