Manages the product catalog for each organization. Handles product CRUD, web crawling via Cloudflare Browser Rendering, AI-powered product extraction from HTML, image analysis and storage in R2, and semantic search via Vectorize embeddings.
Worker name: crow-core-product-service
Domain (prod): internal.products.crowai.dev
Domain (dev): dev.internal.products.crowai.dev
Schema
crawler_job
| Column | Type | Notes |
|---|
| id | text PK | |
| organizationId | text | |
| onboardingId | text | nullable |
| sourceType | text | url, csv, json |
| sourceValue | text | URL or file reference |
| status | text | default pending, also in_progress, completed, failed |
| crawlId | text | nullable |
| productsFound | integer | default 0 |
| productsProcessed | integer | default 0 |
| errorMessage | text | nullable |
| startedAt | timestamp | nullable |
| completedAt | timestamp | nullable |
| createdAt | timestamp | |
| updatedAt | timestamp | |
product
| Column | Type | Notes |
|---|
| id | text PK | Internal UUID |
| organizationId | text | |
| externalId | text | ID from source website |
| title | text | |
| description | text | |
| images | text | JSON array of R2 URLs |
| price | integer | nullable, in cents |
| category | text | nullable |
| metadata | text | nullable, JSON |
| webPageReferences | text | nullable, source URLs |
| productDetailedDescription | text | nullable, AI-synthesized |
| crawlerJobId | text FK | nullable, references crawler_job.id |
| createdAt | timestamp | |
| updatedAt | timestamp | |
product_ai_description
| Column | Type | Notes |
|---|
| id | text PK | |
| productId | text FK | references product.id (cascade delete) |
| imageUrl | text | R2 URL of the image |
| description | text | AI-generated caption |
| features | text | nullable |
| colors | text | nullable |
| materials | text | nullable |
| style | text | nullable |
| modelUsed | text | e.g. @cf/unum/uform-gen2-qwen-500m |
| createdAt | timestamp | |
Routes
Key routes (gateway path products and crawler-jobs):
| Method | Path | Description |
|---|
| POST | /api/v1/products/crawler-jobs | Create a crawl job |
| GET | /api/v1/products/crawler-jobs/{id} | Get crawl job status |
| GET | /api/v1/products/crawler-jobs/organization/{orgId} | List crawl jobs for org |
| GET | /api/v1/products/organization/{orgId} | List products (paginated) |
| GET | /api/v1/products/{id} | Get product by ID |
| GET | /api/v1/products/search | Semantic product search via Vectorize |
| GET | /api/v1/products/images/* | Serve product images from R2 (public) |
Environment Variables
| Variable | Example |
|---|
| ENVIRONMENT | dev |
| AI_GATEWAY_ID | crow-ai-gateway |
| AI_MODEL | @cf/meta/llama-3.1-8b-instruct |
| CRAWLER_SERVICE_URL | https://infra-crawl-service-dev.bitbybit-b3.workers.dev |
| AUTH_SERVICE_URL | https://dev.internal.auth-api.crowai.dev |
Secrets
| Secret | Purpose |
|---|
| CRAWLER_SERVICE_SECRET | Auth for external crawl service |
| BETTER_AUTH_SECRET | JWT verification |
Bindings
| Binding | Type | Name |
|---|
| DB | D1 | crow-core-product-service-db |
| R2_BUCKET | R2 | crow-core-product-service-store (product images) |
| VECTORIZE | Vectorize | crow-products (768-dim, bge-base-en-v1.5) |
| AI | Workers AI | LLM inference |
| BROWSER | Browser Rendering | Page scraping |
| PRODUCT_CRAWL_QUEUE | Queue (producer + consumer) | crow-product-crawl-queue |
Queue Configuration
| Queue | Role | Batch Size | Retries |
|---|
crow-product-crawl-queue | Producer + Consumer | 1 | 3 |
Crawl jobs are enqueued by the auth service (during onboarding) or via the product service API. The same worker consumes them and runs the crawl pipeline.
AI Pipeline
- Browser Rendering fetches page HTML
@cf/meta/llama-3.1-8b-instruct extracts structured product data from HTML
- Product images are downloaded and stored in R2
@cf/unum/uform-gen2-qwen-500m generates image captions (stored in product_ai_description)
@cf/meta/llama-3.1-8b-instruct synthesizes a detailed description from text + image captions
@cf/baai/bge-base-en-v1.5 generates embeddings stored in crow-products Vectorize index
Dependencies
- Inbound: gateway (CRUD), interaction service (product catalog lookup), pattern service (product context)
- Outbound: auth service (JWT verification), external crawl service
Key Behaviors
- SSRF protection:
CreateCrawlerJobSchema validates URLs and blocks localhost, private IPs (10.x, 192.168.x, 172.16-31.x, 169.254.x), .crowai.dev, .internal, .local, and .localhost hostnames
- BOLA: All org-scoped routes check
X-Organization-Id
- Image serving:
/api/v1/products/images/* is public (no auth) for embedding in dashboards