Product Scraping

Overview

CROW's product onboarding system allows organizations to quickly populate their product catalog through multiple methods. The ai-scraping-service provides intelligent web scraping powered by AI to automatically extract product information from company websites.

Product Onboarding Flow

When users create their organization, they can add products through two methods:

1. CSV Import

Direct bulk upload of product data

Upload a CSV file with product information
Supports standard product fields (name, description, SKU, price, etc.)
Immediate import with validation
Files stored in R2, metadata in D1

Use Cases:

Migrating from existing systems
Bulk product catalog updates
Standardized product data

2. Web Scraping

Automated extraction via Product Service

Provide company website URL
AI automatically discovers and extracts product information
Background processing via Cloudflare Queues
Uses Cloudflare Browser Rendering for dynamic content
Analyzes sitemap.xml for context
Refinement interface for quality assurance

Use Cases:

New organization setup
Automated product discovery
Initial product catalog population

Technology

Cloudflare Browser Rendering for JavaScript-rendered pages
Gemini AI via Vercel AI SDK for product extraction
Tavily for AI-optimized web search and lightweight content extraction
Sitemap.xml analysis for site structure understanding
Queue-based processing for scalability

Product Service

The Product Service handles all product-related operations including catalog management and web scraping.

Capabilities

Product CRUD: Create, read, update, delete products
Multi-page Crawling: Navigate complex website structures
Dynamic Content: Handle JavaScript-rendered pages via Browser Rendering
Product Extraction: Identify and extract product information using Gemini AI
Sitemap Analysis: Use sitemap.xml for context and site structure
Data Normalization: Standardize extracted data
Vectorize Integration: Maintain product embeddings for search

Technology Stack

TypeScript on Cloudflare Workers
Cloudflare Browser Rendering for web scraping
Gemini AI via Vercel AI SDK for extraction
Cloudflare AI Gateway for LLM request routing
Vectorize for product embeddings

Architecture

Workflow Steps

URL Validation: Verify target website accessibility
Sitemap Analysis: Parse sitemap.xml for site structure and context
Page Crawling: Navigate and extract content using Browser Rendering
AI Extraction: Use Gemini models to identify and extract products
Data Normalization: Standardize product format
Storage: Save to D1 and generate embeddings
Vectorize: Store embeddings for semantic search
User Refinement: Allow user to review and edit extracted data

Multi-Agent Extraction Pipeline

The Product Service uses a multi-agent orchestration system for intelligent product extraction. Each agent has a specialized role in the extraction workflow.

Agent Architecture

Agent Roles

Agent	Role	Capability
Planner	Strategy definition	Analyzes target source and plans extraction steps
Discoverer	Content discovery	Finds relevant pages using web search from known links
Extractor	Data extraction	Uses Browser Rendering to extract structured fields
Validator	Quality assurance	Normalizes fields into expected schema, validates required attributes
Refiner	Enhancement	Improves records with extra context (cleaner descriptions, tags)

Reusable Design

The agent structure supports reprocessing without full re-extraction:

Raw extraction artifacts stored in R2
Organizations can refine products using stored artifacts
Reduces redundant network requests
Enables quality improvements over time

Scraping Process

Job Lifecycle

Rate Limiting

The ai-scraping-service implements responsible scraping:

Respects robots.txt
Configurable request delays
Domain-based rate limits
Automatic backoff on errors

After scraping completes, users can review and refine results in the Dashboard > Settings > Product Management:

View Extracted Products: See all products with extraction confidence scores
Edit Fields: Modify name, description, price, images
Select and Refine: Choose specific products or rows to edit
Bulk Operations: Edit multiple products at once
Apply Changes: Changes reflected in future analytics

Quality Metrics

Extraction confidence scores (AI certainty)
Field completeness percentage
Duplicate detection flags
Validation errors highlighted

Current Limitations

What You Can Do:

Upload CSV during initial setup
Provide URL for scraping during setup
Refine existing product data (edit fields, update information)
Manual additions and edits anytime

What You Cannot Do (Currently):

Re-scrape entire website after initial setup
Scrape a different URL after initial setup
Automated re-scraping on schedule

Rationale: Product version management is complex. To maintain data integrity and audit trails, the system currently restricts re-scraping. All product versions would need to be maintained, which is planned for future releases.

Future Enhancements

Planned features for product management:

Re-scrape Functionality: Re-scrape entire website with version tracking
Separate URL Scraping: Scrape additional URLs after setup
Version History: View and compare product data across versions
Automated Updates: Schedule regular product updates
Change Detection: Alert on product changes
Product Relationships: Map related products and categories

Storage

D1 Schema

CREATE TABLE products (
    id TEXT PRIMARY KEY,
    org_id TEXT NOT NULL REFERENCES organizations(id),
    name TEXT NOT NULL,
    description TEXT,
    url TEXT,
    price REAL,
    currency TEXT,
    image_url TEXT,
    metadata JSON,
    source TEXT,
    confidence REAL,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE scrape_jobs (
    id TEXT PRIMARY KEY,
    org_id TEXT NOT NULL REFERENCES organizations(id),
    url TEXT NOT NULL,
    status TEXT DEFAULT 'queued',
    products_found INTEGER DEFAULT 0,
    error TEXT,
    started_at DATETIME,
    completed_at DATETIME
);

Vectorize Index

Product embeddings stored for semantic search:

Index: product-embeddings
Dimensions: 1536 (OpenAI ada-002)
Metadata: org_id, category, source

Error Handling

Common Errors

Error	Cause	Resolution
Site Unreachable	Network/DNS issues	Retry with backoff
Blocked	Anti-bot measures	Manual review
Parse Error	Unusual HTML structure	AI fallback
Rate Limited	Too many requests	Increase delays

Recovery

Partial results saved on failure
Jobs can be resumed
Manual scraping option for problematic sites

User Signup Flow - Product onboarding in signup
System Architecture - Overall architecture
Data Storage - Where products are stored

Product Scraping

Overview

Product Onboarding Flow

1. CSV Import

2. Web Scraping

Technology

Product Service

Capabilities

Technology Stack

Architecture

Workflow Steps

Multi-Agent Extraction Pipeline

Agent Architecture

Agent Roles

Reusable Design

Scraping Process

Job Lifecycle

Rate Limiting

Product Refinement

Refinement Interface

Quality Metrics

Current Limitations

Future Enhancements

Storage

D1 Schema

Vectorize Index

Error Handling

Common Errors

Recovery

Overview​

Product Onboarding Flow​

1. CSV Import​

2. Web Scraping​

Technology​

Product Service​

Capabilities​

Technology Stack​

Architecture​

Workflow Steps​

Multi-Agent Extraction Pipeline​

Agent Architecture​

Agent Roles​

Reusable Design​

Scraping Process​

Job Lifecycle​

Rate Limiting​

Product Refinement​

Refinement Interface​

Quality Metrics​

Current Limitations​

Future Enhancements​

Storage​

D1 Schema​

Vectorize Index​

Error Handling​

Common Errors​

Recovery​

Related Documentation​

Overview

Product Onboarding Flow

1. CSV Import

2. Web Scraping

Technology

Product Service

Capabilities

Technology Stack

Architecture

Workflow Steps

Multi-Agent Extraction Pipeline

Agent Architecture

Agent Roles

Reusable Design

Scraping Process

Job Lifecycle

Rate Limiting

Product Refinement

Refinement Interface

Quality Metrics

Current Limitations

Future Enhancements

Storage

D1 Schema

Vectorize Index

Error Handling

Common Errors

Recovery

Related Documentation