Skip to main content

core-product-service

Manages the product catalog for each organization. Handles product CRUD, web crawling via Cloudflare Browser Rendering, AI-powered product extraction from HTML, image analysis and storage in R2, and semantic search via Vectorize embeddings.

Worker name: crow-core-product-service Domain (prod): internal.products.crowai.dev Domain (dev): dev.internal.products.crowai.dev

Schema

crawler_job

ColumnTypeNotes
idtext PK
organizationIdtext
onboardingIdtextnullable
sourceTypetexturl, csv, json
sourceValuetextURL or file reference
statustextdefault pending, also in_progress, completed, failed
crawlIdtextnullable
productsFoundintegerdefault 0
productsProcessedintegerdefault 0
errorMessagetextnullable
startedAttimestampnullable
completedAttimestampnullable
createdAttimestamp
updatedAttimestamp

product

ColumnTypeNotes
idtext PKInternal UUID
organizationIdtext
externalIdtextID from source website
titletext
descriptiontext
imagestextJSON array of R2 URLs
priceintegernullable, in cents
categorytextnullable
metadatatextnullable, JSON
webPageReferencestextnullable, source URLs
productDetailedDescriptiontextnullable, AI-synthesized
crawlerJobIdtext FKnullable, references crawler_job.id
createdAttimestamp
updatedAttimestamp

product_ai_description

ColumnTypeNotes
idtext PK
productIdtext FKreferences product.id (cascade delete)
imageUrltextR2 URL of the image
descriptiontextAI-generated caption
featurestextnullable
colorstextnullable
materialstextnullable
styletextnullable
modelUsedtexte.g. @cf/unum/uform-gen2-qwen-500m
createdAttimestamp

Routes

Key routes (gateway path products and crawler-jobs):

MethodPathDescription
POST/api/v1/products/crawler-jobsCreate a crawl job
GET/api/v1/products/crawler-jobs/{id}Get crawl job status
GET/api/v1/products/crawler-jobs/organization/{orgId}List crawl jobs for org
GET/api/v1/products/organization/{orgId}List products (paginated)
GET/api/v1/products/{id}Get product by ID
GET/api/v1/products/searchSemantic product search via Vectorize
GET/api/v1/products/images/*Serve product images from R2 (public)

Environment Variables

VariableExample
ENVIRONMENTdev
AI_GATEWAY_IDcrow-ai-gateway
AI_MODEL@cf/meta/llama-3.1-8b-instruct
CRAWLER_SERVICE_URLhttps://infra-crawl-service-dev.bitbybit-b3.workers.dev
AUTH_SERVICE_URLhttps://dev.internal.auth-api.crowai.dev

Secrets

SecretPurpose
CRAWLER_SERVICE_SECRETAuth for external crawl service
BETTER_AUTH_SECRETJWT verification

Bindings

BindingTypeName
DBD1crow-core-product-service-db
R2_BUCKETR2crow-core-product-service-store (product images)
VECTORIZEVectorizecrow-products (768-dim, bge-base-en-v1.5)
AIWorkers AILLM inference
BROWSERBrowser RenderingPage scraping
PRODUCT_CRAWL_QUEUEQueue (producer + consumer)crow-product-crawl-queue

Queue Configuration

QueueRoleBatch SizeRetries
crow-product-crawl-queueProducer + Consumer13

Crawl jobs are enqueued by the auth service (during onboarding) or via the product service API. The same worker consumes them and runs the crawl pipeline.

AI Pipeline

  1. Browser Rendering fetches page HTML
  2. @cf/meta/llama-3.1-8b-instruct extracts structured product data from HTML
  3. Product images are downloaded and stored in R2
  4. @cf/unum/uform-gen2-qwen-500m generates image captions (stored in product_ai_description)
  5. @cf/meta/llama-3.1-8b-instruct synthesizes a detailed description from text + image captions
  6. @cf/baai/bge-base-en-v1.5 generates embeddings stored in crow-products Vectorize index

Dependencies

  • Inbound: gateway (CRUD), interaction service (product catalog lookup), pattern service (product context)
  • Outbound: auth service (JWT verification), external crawl service

Key Behaviors

  • SSRF protection: CreateCrawlerJobSchema validates URLs and blocks localhost, private IPs (10.x, 192.168.x, 172.16-31.x, 169.254.x), .crowai.dev, .internal, .local, and .localhost hostnames
  • BOLA: All org-scoped routes check X-Organization-Id
  • Image serving: /api/v1/products/images/* is public (no auth) for embedding in dashboards