Infra Crawl Service

Overview

The infra-crawl-service is a Workers Container that runs headless Puppeteer to crawl and scrape product data from e-commerce websites. It powers the product catalog by extracting product names, descriptions, prices, images, categories, and availability from target storefronts.

Architecture

Runtime: Cloudflare Workers Containers (Python/Node.js with Puppeteer)
Trigger: Called via the API gateway when a user initiates a re-scrape or when scheduled crawler jobs run
Output: Scraped product data is sent to the product service for storage and vectorization

Key Features

Headless browser-based crawling (handles JavaScript-rendered pages)
Product data extraction: name, price, currency, description, images, category, availability
Configurable crawl targets per organization
Re-scrape capability triggered from the dashboard Products page
Rate limiting and polite crawling

Endpoints

Method	Path	Description
`POST`	`/api/v1/crawler-jobs`	Create a new crawl job for a target URL
`GET`	`/api/v1/crawler-jobs/:id`	Get crawl job status
`GET`	`/health`	Health check

Environment

Binding	Type	Purpose
`INTERNAL_GATEWAY_KEY`	Secret	Auth for internal API calls
`GATEWAY_API_URL`	Var	API gateway URL for sending results

Deployment

npx wrangler deploy --env dev

The service deploys as a Workers Container. Container images are built and pushed during the deploy process.

Domain

Dev: dev.internal.infra-crawl.crowai.dev

Overview​

Architecture​

Key Features​

Endpoints​

Environment​

Deployment​

Domain​