Skip to main content

Product Scraping

Overview

CROW's product onboarding system allows organizations to quickly populate their product catalog through multiple methods. The ai-scraping-service provides intelligent web scraping powered by AI to automatically extract product information from company websites.

Product Onboarding Flow

When users create their organization, they can add products through two methods:

1. CSV Import

Direct bulk upload of product data

  • Upload a CSV file with product information
  • Supports standard product fields (name, description, SKU, price, etc.)
  • Immediate import with validation
  • Files stored in R2, metadata in D1

Use Cases:

  • Migrating from existing systems
  • Bulk product catalog updates
  • Standardized product data

2. Web Scraping

Automated extraction via ai-scraping-service

  • Provide company website URL
  • AI automatically discovers and extracts product information
  • Background processing via Cloudflare Workflows
  • Refinement interface for quality assurance

Use Cases:

  • New organization setup
  • Automated product discovery
  • Regular product catalog updates

ai-scraping-service

The ai-scraping-service is a specialized AI service for intelligent web scraping.

Capabilities

  • Multi-page Crawling: Navigate complex website structures
  • Dynamic Content: Handle JavaScript-rendered pages
  • Product Extraction: Identify and extract product information
  • Image Analysis: OCR and visual product recognition
  • Data Normalization: Standardize extracted data

Architecture

Workflow Steps

  1. URL Validation: Verify target website accessibility
  2. Site Analysis: Identify product page patterns
  3. Page Crawling: Navigate and extract content
  4. AI Extraction: Use LLMs to identify products
  5. Data Normalization: Standardize product format
  6. Storage: Save to D1 and Vectorize

Scraping Process

Job Lifecycle

Rate Limiting

The ai-scraping-service implements responsible scraping:

  • Respects robots.txt
  • Configurable request delays
  • Domain-based rate limits
  • Automatic backoff on errors

Product Refinement

After scraping completes, users can review and refine results:

Refinement Interface

  • View extracted products
  • Edit incorrect fields
  • Add missing information
  • Approve or reject items

Quality Metrics

  • Extraction confidence scores
  • Field completeness percentage
  • Duplicate detection flags

Storage

D1 Schema

CREATE TABLE products (
id TEXT PRIMARY KEY,
org_id TEXT NOT NULL REFERENCES organizations(id),
name TEXT NOT NULL,
description TEXT,
url TEXT,
price REAL,
currency TEXT,
image_url TEXT,
metadata JSON,
source TEXT,
confidence REAL,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE scrape_jobs (
id TEXT PRIMARY KEY,
org_id TEXT NOT NULL REFERENCES organizations(id),
url TEXT NOT NULL,
status TEXT DEFAULT 'queued',
products_found INTEGER DEFAULT 0,
error TEXT,
started_at DATETIME,
completed_at DATETIME
);

Vectorize Index

Product embeddings stored for semantic search:

  • Index: product-embeddings
  • Dimensions: 1536 (OpenAI ada-002)
  • Metadata: org_id, category, source

Error Handling

Common Errors

ErrorCauseResolution
Site UnreachableNetwork/DNS issuesRetry with backoff
BlockedAnti-bot measuresManual review
Parse ErrorUnusual HTML structureAI fallback
Rate LimitedToo many requestsIncrease delays

Recovery

  • Partial results saved on failure
  • Jobs can be resumed
  • Manual scraping option for problematic sites