Skip to main content

Product Scraping

Overview

CROW's product onboarding system allows organizations to quickly populate their product catalog through multiple methods. The ai-scraping-service provides intelligent web scraping powered by AI to automatically extract product information from company websites.

Product Onboarding Flow

When users create their organization, they can add products through two methods:

1. CSV Import

Direct bulk upload of product data

  • Upload a CSV file with product information
  • Supports standard product fields (name, description, SKU, price, etc.)
  • Immediate import with validation
  • Files stored in R2, metadata in D1

Use Cases:

  • Migrating from existing systems
  • Bulk product catalog updates
  • Standardized product data

2. Web Scraping

Automated extraction via Product Service

  • Provide company website URL
  • AI automatically discovers and extracts product information
  • Background processing via Cloudflare Queues
  • Uses Cloudflare Browser Rendering for dynamic content
  • Analyzes sitemap.xml for context
  • Refinement interface for quality assurance

Use Cases:

  • New organization setup
  • Automated product discovery
  • Initial product catalog population

Technology

  • Cloudflare Browser Rendering for JavaScript-rendered pages
  • Gemini AI via Vercel AI SDK for product extraction
  • Tavily for AI-optimized web search and lightweight content extraction
  • Sitemap.xml analysis for site structure understanding
  • Queue-based processing for scalability

Product Service

The Product Service handles all product-related operations including catalog management and web scraping.

Capabilities

  • Product CRUD: Create, read, update, delete products
  • Multi-page Crawling: Navigate complex website structures
  • Dynamic Content: Handle JavaScript-rendered pages via Browser Rendering
  • Product Extraction: Identify and extract product information using Gemini AI
  • Sitemap Analysis: Use sitemap.xml for context and site structure
  • Data Normalization: Standardize extracted data
  • Vectorize Integration: Maintain product embeddings for search

Technology Stack

  • TypeScript on Cloudflare Workers
  • Cloudflare Browser Rendering for web scraping
  • Gemini AI via Vercel AI SDK for extraction
  • Cloudflare AI Gateway for LLM request routing
  • Vectorize for product embeddings

Architecture

Workflow Steps

  1. URL Validation: Verify target website accessibility
  2. Sitemap Analysis: Parse sitemap.xml for site structure and context
  3. Page Crawling: Navigate and extract content using Browser Rendering
  4. AI Extraction: Use Gemini models to identify and extract products
  5. Data Normalization: Standardize product format
  6. Storage: Save to D1 and generate embeddings
  7. Vectorize: Store embeddings for semantic search
  8. User Refinement: Allow user to review and edit extracted data

Multi-Agent Extraction Pipeline

The Product Service uses a multi-agent orchestration system for intelligent product extraction. Each agent has a specialized role in the extraction workflow.

Agent Architecture

Agent Roles

AgentRoleCapability
PlannerStrategy definitionAnalyzes target source and plans extraction steps
DiscovererContent discoveryFinds relevant pages using web search from known links
ExtractorData extractionUses Browser Rendering to extract structured fields
ValidatorQuality assuranceNormalizes fields into expected schema, validates required attributes
RefinerEnhancementImproves records with extra context (cleaner descriptions, tags)

Reusable Design

The agent structure supports reprocessing without full re-extraction:

  • Raw extraction artifacts stored in R2
  • Organizations can refine products using stored artifacts
  • Reduces redundant network requests
  • Enables quality improvements over time

Scraping Process

Job Lifecycle

Rate Limiting

The ai-scraping-service implements responsible scraping:

  • Respects robots.txt
  • Configurable request delays
  • Domain-based rate limits
  • Automatic backoff on errors

Product Refinement

After scraping completes, users can review and refine results in the Dashboard > Settings > Product Management:

Refinement Interface

  • View Extracted Products: See all products with extraction confidence scores
  • Edit Fields: Modify name, description, price, images
  • Select and Refine: Choose specific products or rows to edit
  • Bulk Operations: Edit multiple products at once
  • Apply Changes: Changes reflected in future analytics

Quality Metrics

  • Extraction confidence scores (AI certainty)
  • Field completeness percentage
  • Duplicate detection flags
  • Validation errors highlighted

Current Limitations

What You Can Do:

  • Upload CSV during initial setup
  • Provide URL for scraping during setup
  • Refine existing product data (edit fields, update information)
  • Manual additions and edits anytime

What You Cannot Do (Currently):

  • Re-scrape entire website after initial setup
  • Scrape a different URL after initial setup
  • Automated re-scraping on schedule

Rationale: Product version management is complex. To maintain data integrity and audit trails, the system currently restricts re-scraping. All product versions would need to be maintained, which is planned for future releases.

Future Enhancements

Planned features for product management:

  • Re-scrape Functionality: Re-scrape entire website with version tracking
  • Separate URL Scraping: Scrape additional URLs after setup
  • Version History: View and compare product data across versions
  • Automated Updates: Schedule regular product updates
  • Change Detection: Alert on product changes
  • Product Relationships: Map related products and categories

Storage

D1 Schema

CREATE TABLE products (
id TEXT PRIMARY KEY,
org_id TEXT NOT NULL REFERENCES organizations(id),
name TEXT NOT NULL,
description TEXT,
url TEXT,
price REAL,
currency TEXT,
image_url TEXT,
metadata JSON,
source TEXT,
confidence REAL,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE scrape_jobs (
id TEXT PRIMARY KEY,
org_id TEXT NOT NULL REFERENCES organizations(id),
url TEXT NOT NULL,
status TEXT DEFAULT 'queued',
products_found INTEGER DEFAULT 0,
error TEXT,
started_at DATETIME,
completed_at DATETIME
);

Vectorize Index

Product embeddings stored for semantic search:

  • Index: product-embeddings
  • Dimensions: 1536 (OpenAI ada-002)
  • Metadata: org_id, category, source

Error Handling

Common Errors

ErrorCauseResolution
Site UnreachableNetwork/DNS issuesRetry with backoff
BlockedAnti-bot measuresManual review
Parse ErrorUnusual HTML structureAI fallback
Rate LimitedToo many requestsIncrease delays

Recovery

  • Partial results saved on failure
  • Jobs can be resumed
  • Manual scraping option for problematic sites