Skip to main content

Data Storage Architecture

Overview

The CROW Data Storage Architecture is built entirely on Cloudflare's edge infrastructure, providing efficient storage, processing, and retrieval of interaction data from various sources including websites, CCTV, and social media. This architecture leverages D1 (relational database), R2 (object storage), and Vectorize (vector database) for a unified, globally distributed storage solution.

Architecture Components

1. D1 Database

Cloudflare D1 serves as the primary relational database for structured data:

  • Structured Data: Users, organizations, products, API keys
  • Interaction Metadata: Session info, timestamps, sources
  • Processing State: Job status, workflow state
  • Configuration: Settings, preferences, rules

Key Features:

  • SQLite-based with familiar SQL syntax
  • Automatic global replication
  • Point-in-time recovery
  • Sub-millisecond read latency at edge

2. R2 Object Storage

Cloudflare R2 provides S3-compatible object storage:

  • Raw Interactions: Original event payloads
  • CCTV Frames: Extracted video frames
  • Export Files: Generated reports (PDF, CSV)
  • Assets: Images, documents, attachments

Key Features:

  • Zero egress fees
  • S3 API compatibility
  • Automatic global distribution
  • Lifecycle policies for data retention

3. Vectorize

Cloudflare Vectorize provides AI-native vector storage and search:

  • Interaction Embeddings: Vectorized interaction text from all sources
  • Product Embeddings: Product description vectors
  • Social Embeddings: Social media content vectors
  • Organization Context: Organization-specific context embeddings
  • Pattern Embeddings: Discovered pattern vectors

Key Features:

  • High-dimensional vector storage (1536 dimensions for Gemini embeddings)
  • Fast similarity search
  • Metadata filtering
  • Automatic index optimization
  • Used by Chat Service for semantic search

4. Durable Objects

Cloudflare Durable Objects for stateful session management:

  • Web Session State: Active user session tracking
  • Inactivity Alarms: Automatic 1-hour inactivity triggers
  • Builder State: Signup flow state management

Key Features:

  • Strong consistency
  • Low-latency state access
  • Automatic alarm triggers
  • Per-session isolation

Data Flow

Ingestion Flow

Storage Architecture

Retrieval Flow

Detailed Process Flow

1. New Interaction Processing

When a new interaction is received from any component (Web, Social, CCTV):

  1. Reception: Component Worker or Durable Object receives raw data
  2. Queue Dispatch: Session data sent to Interaction Queue
  3. Interaction Service: Consumes queue message
  4. Context Retrieval: Fetches Organization and Product context
  5. AI Processing: Gemini AI analyzes data via Vercel AI SDK
  6. Storage:
    • D1: Metadata record with unique ID, source, timestamps
    • R2: Raw payload (optional, for archival)
    • Vectorize: Text embeddings for semantic search
  7. Analytics: Notifies Analytics Service for billing tracking

2. Pattern Discovery

Pattern Service processes interactions periodically:

  1. Cron Trigger: Daily execution
  2. Retrieve Interactions: Last 24 hours from D1/Vectorize
  3. Context Enrichment: Add Organization and Product context
  4. AI Analysis: Gemini AI identifies patterns via Vercel AI SDK
  5. Storage:
    • D1: Pattern metadata with time period (day, week, month, year)
    • Vectorize: Pattern embeddings
  6. Self-triggering: Queue messages for multi-period aggregation

3. Data Retrieval Process

When retrieving data (Dashboard, Chat, API):

  1. Authentication: Gateway validates session/API key, issues JWT
  2. Pre-filtering: Service queries D1 by criteria (date, product, source, org)
  3. Semantic Search: Vectorize performs vector similarity search
  4. Results Assembly: Matching IDs fetch complete metadata from D1
  5. Optional Raw Data: Retrieve original content from R2 if needed
  6. Permission Check: Validate user has access to requested data

Database Schema (D1)

Core Tables

-- Organizations
CREATE TABLE organizations (
id TEXT PRIMARY KEY,
name TEXT NOT NULL,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
settings JSON
);

-- Users (1:1 with Organizations)
CREATE TABLE users (
id TEXT PRIMARY KEY,
org_id TEXT NOT NULL REFERENCES organizations(id),
email TEXT UNIQUE NOT NULL,
username TEXT,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

-- User Permissions
CREATE TABLE user_permissions (
user_id TEXT PRIMARY KEY REFERENCES users(id),
-- Chat constraints
chat_enabled BOOLEAN DEFAULT true,
chat_min_duration TEXT DEFAULT '1yr',
chat_components JSON DEFAULT '["web"]',
-- Feature toggles
interactions_enabled BOOLEAN DEFAULT true,
patterns_enabled BOOLEAN DEFAULT true,
teams_enabled BOOLEAN DEFAULT false,
api_keys_enabled BOOLEAN DEFAULT false,
billing_enabled BOOLEAN DEFAULT false,
settings_enabled BOOLEAN DEFAULT false,
notifications_enabled BOOLEAN DEFAULT true,
integrations_enabled BOOLEAN DEFAULT false,
products_enabled BOOLEAN DEFAULT false,
updated_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

-- API Keys
CREATE TABLE api_keys (
id TEXT PRIMARY KEY,
org_id TEXT NOT NULL REFERENCES organizations(id),
key_hash TEXT NOT NULL,
name TEXT,
expires_at DATETIME,
-- Permissions
interactions_enabled BOOLEAN DEFAULT true,
patterns_enabled BOOLEAN DEFAULT true,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
last_used DATETIME
);

-- Products
CREATE TABLE products (
id TEXT PRIMARY KEY,
org_id TEXT REFERENCES organizations(id),
name TEXT NOT NULL,
description TEXT,
metadata JSON
);

-- Interactions
CREATE TABLE interactions (
id TEXT PRIMARY KEY,
org_id TEXT NOT NULL REFERENCES organizations(id),
session_id TEXT NOT NULL,
source TEXT NOT NULL, -- 'web', 'social', 'cctv'
description TEXT, -- AI-generated textual description
products JSON, -- Array of product IDs involved
created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
metadata JSON,
vector_id TEXT -- Reference to Vectorize embedding
);

-- Patterns
CREATE TABLE patterns (
id TEXT PRIMARY KEY,
org_id TEXT NOT NULL REFERENCES organizations(id),
time_period TEXT NOT NULL, -- 'day', 'week', 'month', 'year', 'all_time'
description TEXT, -- AI-generated pattern description
products JSON, -- Array of product IDs involved
confidence REAL, -- Pattern confidence score
created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
metadata JSON,
vector_id TEXT
);

-- Chat History
CREATE TABLE chat_sessions (
id TEXT PRIMARY KEY,
user_id TEXT NOT NULL REFERENCES users(id),
org_id TEXT NOT NULL REFERENCES organizations(id),
title TEXT,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
updated_at DATETIME
);

CREATE TABLE chat_messages (
id TEXT PRIMARY KEY,
session_id TEXT NOT NULL REFERENCES chat_sessions(id),
role TEXT NOT NULL, -- 'user' or 'assistant'
content TEXT NOT NULL,
artifacts JSON, -- Generated files, charts, etc.
created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

-- Notifications
CREATE TABLE notifications (
id TEXT PRIMARY KEY,
user_id TEXT NOT NULL REFERENCES users(id),
type TEXT NOT NULL,
title TEXT NOT NULL,
content TEXT,
read BOOLEAN DEFAULT false,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

-- Social Media Link Tracking
CREATE TABLE seen_links (
url TEXT PRIMARY KEY,
org_id TEXT NOT NULL,
first_seen DATETIME DEFAULT CURRENT_TIMESTAMP,
last_checked DATETIME,
content_hash TEXT,
source TEXT -- 'search' or 'direct'
);

R2 Bucket Structure

crow-storage/
├── raw/
│ ├── interactions/
│ │ └── {org_id}/{component}/{date}/{interaction_id}.json
│ └── sessions/
│ └── {org_id}/{component}/{session_id}.json
├── chat/
│ ├── artifacts/
│ │ └── {org_id}/{chat_id}/{artifact_id}
│ └── assets/
│ └── {org_id}/{chat_id}/{asset_id}
├── exports/
│ └── {org_id}/{export_id}/{filename}
└── products/
└── {org_id}/{product_id}/{image}

Directory Purposes:

  • raw/: Optional raw interaction and session data archival
  • chat/: Chat-generated artifacts (reports, charts) and assets (images)
  • exports/: User-requested data exports (CSV, PDF, JSON)
  • products/: Product images and media

Vectorize Indexes

interaction-embeddings

  • Dimensions: 1536 (Gemini embeddings)
  • Metric: Cosine similarity
  • Metadata: org_id, source (web/social/cctv), session_id, created_at
  • Purpose: Semantic search across all interactions

pattern-embeddings

  • Dimensions: 1536 (Gemini embeddings)
  • Metric: Cosine similarity
  • Metadata: org_id, time_period, confidence, created_at
  • Purpose: Pattern discovery and trend analysis

product-embeddings

  • Dimensions: 1536 (Gemini embeddings)
  • Metric: Cosine similarity
  • Metadata: org_id, category, price_range
  • Purpose: Product similarity and recommendations

organization-context-embeddings

  • Dimensions: 1536 (Gemini embeddings)
  • Metric: Cosine similarity
  • Metadata: org_id, context_type, updated_at
  • Purpose: Organization-specific context for AI processing

All embeddings generated via Gemini AI through Vercel AI SDK

Technology Stack

ComponentCloudflare ServicePurpose
Relational DBD1Structured data, metadata
Object StorageR2Raw data, files, exports, chat artifacts
Vector DBVectorizeSemantic search, AI embeddings
Session StateDurable ObjectsWeb session management, builder state
CachingWorkers KVSession cache, JWT mapping (1-5 min TTL)
QueuingQueuesAsync processing, service communication

AI Technology:

  • Models: Google Gemini (all AI processing)
  • Framework: Vercel AI SDK (agent orchestration)
  • Gateway: Cloudflare AI Gateway (LLM request routing)

Advantages

Cloudflare Platform Benefits

  • Global Distribution: Data served from 300+ edge locations
  • Zero Egress: No fees for data transfer out of R2
  • Integrated: Seamless connectivity between D1, R2, Vectorize
  • Simplified Operations: No server management required
  • Cost Effective: Pay-per-use pricing model

Architecture Benefits

  • Separation of Concerns: Relational, object, and vector data stored optimally
  • Flexibility: Each layer optimized for its specific use case
  • Scalability: Cloudflare manages scaling automatically
  • Performance: Edge distribution ensures low latency globally

Data Retention & Lifecycle

Retention Policies

  • Interactions Metadata: Retained indefinitely in D1
  • Pattern Metadata: Retained indefinitely (all time periods)
  • Raw Interaction Data: Optional archival in R2 (30 days hot, then archive)
  • Chat History: Retained indefinitely
  • Chat Artifacts: 30 days in R2
  • Exports: 7 days in R2, auto-deleted after
  • Session Cache: 1-5 minutes TTL in Workers KV

Privacy Compliance:

  • GDPR: User can request data deletion from settings
  • Organization Context: Clearable from dashboard
  • CCTV Data: Not stored long-term (processed in real-time)

R2 Lifecycle Rules

{
"rules": [
{
"id": "archive-raw-interactions",
"filter": { "prefix": "raw/interactions/" },
"transitions": [
{ "days": 30, "storageClass": "ARCHIVE" }
]
},
{
"id": "delete-chat-artifacts",
"filter": { "prefix": "chat/" },
"expiration": { "days": 30 }
},
{
"id": "delete-exports",
"filter": { "prefix": "exports/" },
"expiration": { "days": 7 }
}
]
}

Future Considerations

Scalability Planning

  • Monitor D1 database size and query patterns
  • Implement data partitioning if needed
  • Plan for multi-region data residency requirements

Performance Optimization

  • Implement caching layer with Workers KV
  • Optimize Vectorize index configuration
  • Consider hybrid search (vector + keyword)

Compliance

  • GDPR data deletion workflows
  • Data residency controls
  • Audit logging for access

Glossary

  • D1: Cloudflare's serverless SQLite database
  • R2: Cloudflare's S3-compatible object storage
  • Vectorize: Cloudflare's vector database for AI embeddings
  • Embeddings: Numerical vector representations of text