Skip to main content

Infra Crawl Service

Overview

The infra-crawl-service is a Workers Container that runs headless Puppeteer to crawl and scrape product data from e-commerce websites. It powers the product catalog by extracting product names, descriptions, prices, images, categories, and availability from target storefronts.

Architecture

  • Runtime: Cloudflare Workers Containers (Python/Node.js with Puppeteer)
  • Trigger: Called via the API gateway when a user initiates a re-scrape or when scheduled crawler jobs run
  • Output: Scraped product data is sent to the product service for storage and vectorization

Key Features

  • Headless browser-based crawling (handles JavaScript-rendered pages)
  • Product data extraction: name, price, currency, description, images, category, availability
  • Configurable crawl targets per organization
  • Re-scrape capability triggered from the dashboard Products page
  • Rate limiting and polite crawling

Endpoints

MethodPathDescription
POST/api/v1/crawler-jobsCreate a new crawl job for a target URL
GET/api/v1/crawler-jobs/:idGet crawl job status
GET/healthHealth check

Environment

BindingTypePurpose
INTERNAL_GATEWAY_KEYSecretAuth for internal API calls
GATEWAY_API_URLVarAPI gateway URL for sending results

Deployment

npx wrangler deploy --env dev

The service deploys as a Workers Container. Container images are built and pushed during the deploy process.

Domain

  • Dev: dev.internal.infra-crawl.crowai.dev