Skip to main content

Data Storage Architecture

Overview

The CROW Data Storage Architecture is built entirely on Cloudflare's edge infrastructure, providing efficient storage, processing, and retrieval of interaction data from various sources including websites, CCTV, and social media. This architecture leverages D1 (relational database), R2 (object storage), and Vectorize (vector database) for a unified, globally distributed storage solution.

Architecture Components

1. D1 Database

Cloudflare D1 serves as the primary relational database for structured data:

  • Structured Data: Users, organizations, products, API keys
  • Interaction Metadata: Session info, timestamps, sources
  • Processing State: Job status, workflow state
  • Configuration: Settings, preferences, rules

Key Features:

  • SQLite-based with familiar SQL syntax
  • Automatic global replication
  • Point-in-time recovery
  • Sub-millisecond read latency at edge

2. R2 Object Storage

Cloudflare R2 provides S3-compatible object storage:

  • Raw Interactions: Original event payloads
  • CCTV Frames: Extracted video frames
  • Export Files: Generated reports (PDF, CSV)
  • Assets: Images, documents, attachments

Key Features:

  • Zero egress fees
  • S3 API compatibility
  • Automatic global distribution
  • Lifecycle policies for data retention

3. Vectorize

Cloudflare Vectorize provides AI-native vector storage and search:

  • Interaction Embeddings: Vectorized interaction text
  • Product Embeddings: Product description vectors
  • Social Embeddings: Social media content vectors

Key Features:

  • High-dimensional vector storage
  • Fast similarity search
  • Metadata filtering
  • Automatic index optimization

Data Flow

Ingestion Flow

Storage Architecture

Retrieval Flow

Detailed Process Flow

1. New Interaction Processing

When a new interaction is received:

  1. Reception: Ingestion Worker receives raw data from SDK
  2. Validation: Data schema and API key validation
  3. Storage:
    • R2: Raw payload stored for archival
    • D1: Metadata record created with unique ID
    • Vectorize: Text vectorized and stored for semantic search
  4. Queue: Session queued for AI processing via Cloudflare Queues

2. Data Retrieval Process

When retrieving interactions:

  1. Pre-filtering: D1 queried to filter by criteria (date, product, source)
  2. Vector Search: Vectorize performs semantic search on filtered set
  3. Results Assembly: Matching IDs fetch complete metadata from D1
  4. Raw Data: Optional raw content retrieved from R2

Database Schema (D1)

Core Tables

-- Organizations
CREATE TABLE organizations (
id TEXT PRIMARY KEY,
name TEXT NOT NULL,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
settings JSON
);

-- Users
CREATE TABLE users (
id TEXT PRIMARY KEY,
org_id TEXT REFERENCES organizations(id),
email TEXT UNIQUE NOT NULL,
role TEXT DEFAULT 'member'
);

-- Products
CREATE TABLE products (
id TEXT PRIMARY KEY,
org_id TEXT REFERENCES organizations(id),
name TEXT NOT NULL,
description TEXT,
metadata JSON
);

-- Interactions
CREATE TABLE interactions (
id TEXT PRIMARY KEY,
org_id TEXT REFERENCES organizations(id),
session_id TEXT,
source TEXT,
type TEXT,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
metadata JSON,
vector_id TEXT
);

R2 Bucket Structure

crow-storage/
├── raw/
│ ├── interactions/
│ │ └── {org_id}/{date}/{interaction_id}.json
│ └── sessions/
│ └── {org_id}/{session_id}.json
├── cctv/
│ └── {org_id}/{camera_id}/{timestamp}.jpg
├── exports/
│ └── {org_id}/{export_id}/{filename}
└── assets/
└── {org_id}/{asset_id}

Vectorize Indexes

interaction-embeddings

  • Dimensions: 1536 (OpenAI ada-002)
  • Metric: Cosine similarity
  • Metadata: org_id, source, type, created_at

product-embeddings

  • Dimensions: 1536
  • Metric: Cosine similarity
  • Metadata: org_id, category, status

social-embeddings

  • Dimensions: 1536
  • Metric: Cosine similarity
  • Metadata: org_id, platform, sentiment

Technology Stack

ComponentCloudflare ServicePurpose
Relational DBD1Structured data, metadata
Object StorageR2Raw data, files, exports
Vector DBVectorizeSemantic search, embeddings
CachingWorkers KVSession cache, hot data
QueuingQueuesAsync processing

Advantages

Cloudflare Platform Benefits

  • Global Distribution: Data served from 300+ edge locations
  • Zero Egress: No fees for data transfer out of R2
  • Integrated: Seamless connectivity between D1, R2, Vectorize
  • Simplified Operations: No server management required
  • Cost Effective: Pay-per-use pricing model

Architecture Benefits

  • Separation of Concerns: Relational, object, and vector data stored optimally
  • Flexibility: Each layer optimized for its specific use case
  • Scalability: Cloudflare manages scaling automatically
  • Performance: Edge distribution ensures low latency globally

Data Retention & Lifecycle

Retention Policies

  • Interactions: 90 days in hot storage, archived after
  • Raw Data: 30 days, then compressed archive
  • Exports: 7 days, auto-deleted after
  • CCTV Frames: 24 hours, privacy compliance

R2 Lifecycle Rules

{
"rules": [
{
"id": "archive-raw-data",
"filter": { "prefix": "raw/" },
"transitions": [
{ "days": 30, "storageClass": "ARCHIVE" }
]
},
{
"id": "delete-exports",
"filter": { "prefix": "exports/" },
"expiration": { "days": 7 }
}
]
}

Future Considerations

Scalability Planning

  • Monitor D1 database size and query patterns
  • Implement data partitioning if needed
  • Plan for multi-region data residency requirements

Performance Optimization

  • Implement caching layer with Workers KV
  • Optimize Vectorize index configuration
  • Consider hybrid search (vector + keyword)

Compliance

  • GDPR data deletion workflows
  • Data residency controls
  • Audit logging for access

Glossary

  • D1: Cloudflare's serverless SQLite database
  • R2: Cloudflare's S3-compatible object storage
  • Vectorize: Cloudflare's vector database for AI embeddings
  • Embeddings: Numerical vector representations of text