Product Scraping
Overview
CROW's product onboarding system allows organizations to quickly populate their product catalog through multiple methods. The ai-scraping-service provides intelligent web scraping powered by AI to automatically extract product information from company websites.
Product Onboarding Flow
When users create their organization, they can add products through two methods:
1. CSV Import
Direct bulk upload of product data
- Upload a CSV file with product information
- Supports standard product fields (name, description, SKU, price, etc.)
- Immediate import with validation
- Files stored in R2, metadata in D1
Use Cases:
- Migrating from existing systems
- Bulk product catalog updates
- Standardized product data
2. Web Scraping
Automated extraction via ai-scraping-service
- Provide company website URL
- AI automatically discovers and extracts product information
- Background processing via Cloudflare Workflows
- Refinement interface for quality assurance
Use Cases:
- New organization setup
- Automated product discovery
- Regular product catalog updates
ai-scraping-service
The ai-scraping-service is a specialized AI service for intelligent web scraping.
Capabilities
- Multi-page Crawling: Navigate complex website structures
- Dynamic Content: Handle JavaScript-rendered pages
- Product Extraction: Identify and extract product information
- Image Analysis: OCR and visual product recognition
- Data Normalization: Standardize extracted data
Architecture
Workflow Steps
- URL Validation: Verify target website accessibility
- Site Analysis: Identify product page patterns
- Page Crawling: Navigate and extract content
- AI Extraction: Use LLMs to identify products
- Data Normalization: Standardize product format
- Storage: Save to D1 and Vectorize
Scraping Process
Job Lifecycle
Rate Limiting
The ai-scraping-service implements responsible scraping:
- Respects robots.txt
- Configurable request delays
- Domain-based rate limits
- Automatic backoff on errors
Product Refinement
After scraping completes, users can review and refine results:
Refinement Interface
- View extracted products
- Edit incorrect fields
- Add missing information
- Approve or reject items
Quality Metrics
- Extraction confidence scores
- Field completeness percentage
- Duplicate detection flags
Storage
D1 Schema
CREATE TABLE products (
id TEXT PRIMARY KEY,
org_id TEXT NOT NULL REFERENCES organizations(id),
name TEXT NOT NULL,
description TEXT,
url TEXT,
price REAL,
currency TEXT,
image_url TEXT,
metadata JSON,
source TEXT,
confidence REAL,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE scrape_jobs (
id TEXT PRIMARY KEY,
org_id TEXT NOT NULL REFERENCES organizations(id),
url TEXT NOT NULL,
status TEXT DEFAULT 'queued',
products_found INTEGER DEFAULT 0,
error TEXT,
started_at DATETIME,
completed_at DATETIME
);
Vectorize Index
Product embeddings stored for semantic search:
- Index:
product-embeddings - Dimensions: 1536 (OpenAI ada-002)
- Metadata: org_id, category, source
Error Handling
Common Errors
| Error | Cause | Resolution |
|---|---|---|
| Site Unreachable | Network/DNS issues | Retry with backoff |
| Blocked | Anti-bot measures | Manual review |
| Parse Error | Unusual HTML structure | AI fallback |
| Rate Limited | Too many requests | Increase delays |
Recovery
- Partial results saved on failure
- Jobs can be resumed
- Manual scraping option for problematic sites
Related Documentation
- User Signup Flow - Product onboarding in signup
- System Architecture - Overall architecture
- Data Storage - Where products are stored