Product Scraping
Overview
CROW's product onboarding system allows organizations to quickly populate their product catalog through multiple methods. The ai-scraping-service provides intelligent web scraping powered by AI to automatically extract product information from company websites.
Product Onboarding Flow
When users create their organization, they can add products through two methods:
1. CSV Import
Direct bulk upload of product data
- Upload a CSV file with product information
- Supports standard product fields (name, description, SKU, price, etc.)
- Immediate import with validation
- Files stored in R2, metadata in D1
Use Cases:
- Migrating from existing systems
- Bulk product catalog updates
- Standardized product data
2. Web Scraping
Automated extraction via Product Service
- Provide company website URL
- AI automatically discovers and extracts product information
- Background processing via Cloudflare Queues
- Uses Cloudflare Browser Rendering for dynamic content
- Analyzes sitemap.xml for context
- Refinement interface for quality assurance
Use Cases:
- New organization setup
- Automated product discovery
- Initial product catalog population
Technology
- Cloudflare Browser Rendering for JavaScript-rendered pages
- Gemini AI via Vercel AI SDK for product extraction
- Tavily for AI-optimized web search and lightweight content extraction
- Sitemap.xml analysis for site structure understanding
- Queue-based processing for scalability
Product Service
The Product Service handles all product-related operations including catalog management and web scraping.
Capabilities
- Product CRUD: Create, read, update, delete products
- Multi-page Crawling: Navigate complex website structures
- Dynamic Content: Handle JavaScript-rendered pages via Browser Rendering
- Product Extraction: Identify and extract product information using Gemini AI
- Sitemap Analysis: Use sitemap.xml for context and site structure
- Data Normalization: Standardize extracted data
- Vectorize Integration: Maintain product embeddings for search
Technology Stack
- TypeScript on Cloudflare Workers
- Cloudflare Browser Rendering for web scraping
- Gemini AI via Vercel AI SDK for extraction
- Cloudflare AI Gateway for LLM request routing
- Vectorize for product embeddings
Architecture
Workflow Steps
- URL Validation: Verify target website accessibility
- Sitemap Analysis: Parse sitemap.xml for site structure and context
- Page Crawling: Navigate and extract content using Browser Rendering
- AI Extraction: Use Gemini models to identify and extract products
- Data Normalization: Standardize product format
- Storage: Save to D1 and generate embeddings
- Vectorize: Store embeddings for semantic search
- User Refinement: Allow user to review and edit extracted data
Multi-Agent Extraction Pipeline
The Product Service uses a multi-agent orchestration system for intelligent product extraction. Each agent has a specialized role in the extraction workflow.
Agent Architecture
Agent Roles
| Agent | Role | Capability |
|---|---|---|
| Planner | Strategy definition | Analyzes target source and plans extraction steps |
| Discoverer | Content discovery | Finds relevant pages using web search from known links |
| Extractor | Data extraction | Uses Browser Rendering to extract structured fields |
| Validator | Quality assurance | Normalizes fields into expected schema, validates required attributes |
| Refiner | Enhancement | Improves records with extra context (cleaner descriptions, tags) |
Reusable Design
The agent structure supports reprocessing without full re-extraction:
- Raw extraction artifacts stored in R2
- Organizations can refine products using stored artifacts
- Reduces redundant network requests
- Enables quality improvements over time
Scraping Process
Job Lifecycle
Rate Limiting
The ai-scraping-service implements responsible scraping:
- Respects robots.txt
- Configurable request delays
- Domain-based rate limits
- Automatic backoff on errors
Product Refinement
After scraping completes, users can review and refine results in the Dashboard > Settings > Product Management:
Refinement Interface
- View Extracted Products: See all products with extraction confidence scores
- Edit Fields: Modify name, description, price, images
- Select and Refine: Choose specific products or rows to edit
- Bulk Operations: Edit multiple products at once
- Apply Changes: Changes reflected in future analytics
Quality Metrics
- Extraction confidence scores (AI certainty)
- Field completeness percentage
- Duplicate detection flags
- Validation errors highlighted
Current Limitations
What You Can Do:
- Upload CSV during initial setup
- Provide URL for scraping during setup
- Refine existing product data (edit fields, update information)
- Manual additions and edits anytime
What You Cannot Do (Currently):
- Re-scrape entire website after initial setup
- Scrape a different URL after initial setup
- Automated re-scraping on schedule
Rationale: Product version management is complex. To maintain data integrity and audit trails, the system currently restricts re-scraping. All product versions would need to be maintained, which is planned for future releases.
Future Enhancements
Planned features for product management:
- Re-scrape Functionality: Re-scrape entire website with version tracking
- Separate URL Scraping: Scrape additional URLs after setup
- Version History: View and compare product data across versions
- Automated Updates: Schedule regular product updates
- Change Detection: Alert on product changes
- Product Relationships: Map related products and categories
Storage
D1 Schema
CREATE TABLE products (
id TEXT PRIMARY KEY,
org_id TEXT NOT NULL REFERENCES organizations(id),
name TEXT NOT NULL,
description TEXT,
url TEXT,
price REAL,
currency TEXT,
image_url TEXT,
metadata JSON,
source TEXT,
confidence REAL,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE scrape_jobs (
id TEXT PRIMARY KEY,
org_id TEXT NOT NULL REFERENCES organizations(id),
url TEXT NOT NULL,
status TEXT DEFAULT 'queued',
products_found INTEGER DEFAULT 0,
error TEXT,
started_at DATETIME,
completed_at DATETIME
);
Vectorize Index
Product embeddings stored for semantic search:
- Index:
product-embeddings - Dimensions: 1536 (OpenAI ada-002)
- Metadata: org_id, category, source
Error Handling
Common Errors
| Error | Cause | Resolution |
|---|---|---|
| Site Unreachable | Network/DNS issues | Retry with backoff |
| Blocked | Anti-bot measures | Manual review |
| Parse Error | Unusual HTML structure | AI fallback |
| Rate Limited | Too many requests | Increase delays |
Recovery
- Partial results saved on failure
- Jobs can be resumed
- Manual scraping option for problematic sites
Related Documentation
- User Signup Flow - Product onboarding in signup
- System Architecture - Overall architecture
- Data Storage - Where products are stored