Data Storage Architecture
Overview
The CROW Data Storage Architecture is built entirely on Cloudflare's edge infrastructure, providing efficient storage, processing, and retrieval of interaction data from various sources including websites, CCTV, and social media. This architecture leverages D1 (relational database), R2 (object storage), and Vectorize (vector database) for a unified, globally distributed storage solution.
Architecture Components
1. D1 Database
Cloudflare D1 serves as the primary relational database for structured data:
- Structured Data: Users, organizations, products, API keys
- Interaction Metadata: Session info, timestamps, sources
- Processing State: Job status, workflow state
- Configuration: Settings, preferences, rules
Key Features:
- SQLite-based with familiar SQL syntax
- Automatic global replication
- Point-in-time recovery
- Sub-millisecond read latency at edge
2. R2 Object Storage
Cloudflare R2 provides S3-compatible object storage:
- Raw Interactions: Original event payloads
- CCTV Frames: Extracted video frames
- Export Files: Generated reports (PDF, CSV)
- Assets: Images, documents, attachments
Key Features:
- Zero egress fees
- S3 API compatibility
- Automatic global distribution
- Lifecycle policies for data retention
3. Vectorize
Cloudflare Vectorize provides AI-native vector storage and search:
- Interaction Embeddings: Vectorized interaction text
- Product Embeddings: Product description vectors
- Social Embeddings: Social media content vectors
Key Features:
- High-dimensional vector storage
- Fast similarity search
- Metadata filtering
- Automatic index optimization
Data Flow
Ingestion Flow
Storage Architecture
Retrieval Flow
Detailed Process Flow
1. New Interaction Processing
When a new interaction is received:
- Reception: Ingestion Worker receives raw data from SDK
- Validation: Data schema and API key validation
- Storage:
- R2: Raw payload stored for archival
- D1: Metadata record created with unique ID
- Vectorize: Text vectorized and stored for semantic search
- Queue: Session queued for AI processing via Cloudflare Queues
2. Data Retrieval Process
When retrieving interactions:
- Pre-filtering: D1 queried to filter by criteria (date, product, source)
- Vector Search: Vectorize performs semantic search on filtered set
- Results Assembly: Matching IDs fetch complete metadata from D1
- Raw Data: Optional raw content retrieved from R2
Database Schema (D1)
Core Tables
-- Organizations
CREATE TABLE organizations (
id TEXT PRIMARY KEY,
name TEXT NOT NULL,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
settings JSON
);
-- Users
CREATE TABLE users (
id TEXT PRIMARY KEY,
org_id TEXT REFERENCES organizations(id),
email TEXT UNIQUE NOT NULL,
role TEXT DEFAULT 'member'
);
-- Products
CREATE TABLE products (
id TEXT PRIMARY KEY,
org_id TEXT REFERENCES organizations(id),
name TEXT NOT NULL,
description TEXT,
metadata JSON
);
-- Interactions
CREATE TABLE interactions (
id TEXT PRIMARY KEY,
org_id TEXT REFERENCES organizations(id),
session_id TEXT,
source TEXT,
type TEXT,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
metadata JSON,
vector_id TEXT
);
R2 Bucket Structure
crow-storage/
├── raw/
│ ├── interactions/
│ │ └── {org_id}/{date}/{interaction_id}.json
│ └── sessions/
│ └── {org_id}/{session_id}.json
├── cctv/
│ └── {org_id}/{camera_id}/{timestamp}.jpg
├── exports/
│ └── {org_id}/{export_id}/{filename}
└── assets/
└── {org_id}/{asset_id}
Vectorize Indexes
interaction-embeddings
- Dimensions: 1536 (OpenAI ada-002)
- Metric: Cosine similarity
- Metadata: org_id, source, type, created_at
product-embeddings
- Dimensions: 1536
- Metric: Cosine similarity
- Metadata: org_id, category, status
social-embeddings
- Dimensions: 1536
- Metric: Cosine similarity
- Metadata: org_id, platform, sentiment
Technology Stack
| Component | Cloudflare Service | Purpose |
|---|---|---|
| Relational DB | D1 | Structured data, metadata |
| Object Storage | R2 | Raw data, files, exports |
| Vector DB | Vectorize | Semantic search, embeddings |
| Caching | Workers KV | Session cache, hot data |
| Queuing | Queues | Async processing |
Advantages
Cloudflare Platform Benefits
- Global Distribution: Data served from 300+ edge locations
- Zero Egress: No fees for data transfer out of R2
- Integrated: Seamless connectivity between D1, R2, Vectorize
- Simplified Operations: No server management required
- Cost Effective: Pay-per-use pricing model
Architecture Benefits
- Separation of Concerns: Relational, object, and vector data stored optimally
- Flexibility: Each layer optimized for its specific use case
- Scalability: Cloudflare manages scaling automatically
- Performance: Edge distribution ensures low latency globally
Data Retention & Lifecycle
Retention Policies
- Interactions: 90 days in hot storage, archived after
- Raw Data: 30 days, then compressed archive
- Exports: 7 days, auto-deleted after
- CCTV Frames: 24 hours, privacy compliance
R2 Lifecycle Rules
{
"rules": [
{
"id": "archive-raw-data",
"filter": { "prefix": "raw/" },
"transitions": [
{ "days": 30, "storageClass": "ARCHIVE" }
]
},
{
"id": "delete-exports",
"filter": { "prefix": "exports/" },
"expiration": { "days": 7 }
}
]
}
Future Considerations
Scalability Planning
- Monitor D1 database size and query patterns
- Implement data partitioning if needed
- Plan for multi-region data residency requirements
Performance Optimization
- Implement caching layer with Workers KV
- Optimize Vectorize index configuration
- Consider hybrid search (vector + keyword)
Compliance
- GDPR data deletion workflows
- Data residency controls
- Audit logging for access
Glossary
- D1: Cloudflare's serverless SQLite database
- R2: Cloudflare's S3-compatible object storage
- Vectorize: Cloudflare's vector database for AI embeddings
- Embeddings: Numerical vector representations of text