Infra Crawl Service
Overview
The infra-crawl-service is a Workers Container that runs headless Puppeteer to crawl and scrape product data from e-commerce websites. It powers the product catalog by extracting product names, descriptions, prices, images, categories, and availability from target storefronts.
Architecture
- Runtime: Cloudflare Workers Containers (Python/Node.js with Puppeteer)
- Trigger: Called via the API gateway when a user initiates a re-scrape or when scheduled crawler jobs run
- Output: Scraped product data is sent to the product service for storage and vectorization
Key Features
- Headless browser-based crawling (handles JavaScript-rendered pages)
- Product data extraction: name, price, currency, description, images, category, availability
- Configurable crawl targets per organization
- Re-scrape capability triggered from the dashboard Products page
- Rate limiting and polite crawling
Endpoints
| Method | Path | Description |
|---|---|---|
POST | /api/v1/crawler-jobs | Create a new crawl job for a target URL |
GET | /api/v1/crawler-jobs/:id | Get crawl job status |
GET | /health | Health check |
Environment
| Binding | Type | Purpose |
|---|---|---|
INTERNAL_GATEWAY_KEY | Secret | Auth for internal API calls |
GATEWAY_API_URL | Var | API gateway URL for sending results |
Deployment
npx wrangler deploy --env dev
The service deploys as a Workers Container. Container images are built and pushed during the deploy process.
Domain
- Dev:
dev.internal.infra-crawl.crowai.dev