core-product-service

Manages the product catalog for each organization. Handles product CRUD, web crawling via Cloudflare Browser Rendering, AI-powered product extraction from HTML, image analysis and storage in R2, and semantic search via Vectorize embeddings.

Worker name: crow-core-product-service Domain (prod): internal.products.crowai.dev Domain (dev): dev.internal.products.crowai.dev

Schema

crawler_job

Column	Type	Notes
id	text PK
organizationId	text
onboardingId	text	nullable
sourceType	text	`url`, `csv`, `json`
sourceValue	text	URL or file reference
status	text	default `pending`, also `in_progress`, `completed`, `failed`
crawlId	text	nullable
productsFound	integer	default `0`
productsProcessed	integer	default `0`
errorMessage	text	nullable
startedAt	timestamp	nullable
completedAt	timestamp	nullable
createdAt	timestamp
updatedAt	timestamp

product

Column	Type	Notes
id	text PK	Internal UUID
organizationId	text
externalId	text	ID from source website
title	text
description	text
images	text	JSON array of R2 URLs
price	integer	nullable, in cents
category	text	nullable
metadata	text	nullable, JSON
webPageReferences	text	nullable, source URLs
productDetailedDescription	text	nullable, AI-synthesized
crawlerJobId	text FK	nullable, references crawler_job.id
createdAt	timestamp
updatedAt	timestamp

product_ai_description

Column	Type	Notes
id	text PK
productId	text FK	references product.id (cascade delete)
imageUrl	text	R2 URL of the image
description	text	AI-generated caption
features	text	nullable
colors	text	nullable
materials	text	nullable
style	text	nullable
modelUsed	text	e.g. `@cf/unum/uform-gen2-qwen-500m`
createdAt	timestamp

Routes

Key routes (gateway path products and crawler-jobs):

Method	Path	Description
POST	`/api/v1/products/crawler-jobs`	Create a crawl job
GET	`/api/v1/products/crawler-jobs/{id}`	Get crawl job status
GET	`/api/v1/products/crawler-jobs/organization/{orgId}`	List crawl jobs for org
GET	`/api/v1/products/organization/{orgId}`	List products (paginated)
GET	`/api/v1/products/{id}`	Get product by ID
GET	`/api/v1/products/search`	Semantic product search via Vectorize
GET	`/api/v1/products/images/*`	Serve product images from R2 (public)

Environment Variables

Variable	Example
ENVIRONMENT	`dev`
AI_GATEWAY_ID	`crow-ai-gateway`
AI_MODEL	`@cf/meta/llama-3.1-8b-instruct`
CRAWLER_SERVICE_URL	`https://infra-crawl-service-dev.bitbybit-b3.workers.dev`
AUTH_SERVICE_URL	`https://dev.internal.auth-api.crowai.dev`

Secrets

Secret	Purpose
CRAWLER_SERVICE_SECRET	Auth for external crawl service
BETTER_AUTH_SECRET	JWT verification

Bindings

Binding	Type	Name
DB	D1	`crow-core-product-service-db`
R2_BUCKET	R2	`crow-core-product-service-store` (product images)
VECTORIZE	Vectorize	`crow-products` (768-dim, bge-base-en-v1.5)
AI	Workers AI	LLM inference
BROWSER	Browser Rendering	Page scraping
PRODUCT_CRAWL_QUEUE	Queue (producer + consumer)	`crow-product-crawl-queue`

Queue Configuration

Queue	Role	Batch Size	Retries
`crow-product-crawl-queue`	Producer + Consumer	1	3

Crawl jobs are enqueued by the auth service (during onboarding) or via the product service API. The same worker consumes them and runs the crawl pipeline.

AI Pipeline

Browser Rendering fetches page HTML
@cf/meta/llama-3.1-8b-instruct extracts structured product data from HTML
Product images are downloaded and stored in R2
@cf/unum/uform-gen2-qwen-500m generates image captions (stored in product_ai_description)
@cf/meta/llama-3.1-8b-instruct synthesizes a detailed description from text + image captions
@cf/baai/bge-base-en-v1.5 generates embeddings stored in crow-products Vectorize index

Dependencies

Inbound: gateway (CRUD), interaction service (product catalog lookup), pattern service (product context)
Outbound: auth service (JWT verification), external crawl service

Key Behaviors

SSRF protection: CreateCrawlerJobSchema validates URLs and blocks localhost, private IPs (10.x, 192.168.x, 172.16-31.x, 169.254.x), .crowai.dev, .internal, .local, and .localhost hostnames
BOLA: All org-scoped routes check X-Organization-Id
Image serving: /api/v1/products/images/* is public (no auth) for embedding in dashboards

Schema​

crawler_job​

product​

product_ai_description​

Routes​

Environment Variables​

Secrets​

Bindings​

Queue Configuration​

AI Pipeline​

Dependencies​

Key Behaviors​