core-social-collector
Social media collection service that gathers mentions, reviews, and brand-related content from social platforms. Runs on a cron schedule (every 2 hours) and supports manual triggering. Collected items are dispatched to a Queue for downstream processing by the social processor.
Worker name: crow-social-collector
Domain (prod): social-collector.crowai.dev
Domain (dev): dev.social-collector.crowai.dev
Architecture
The collector supports two collection strategies:
- Search collection -- Uses Tavily API to search for brand mentions and keywords across the web. Configured via
social_search_configswith keywords, brands, and region filters. - Direct collection -- Scrapes specific social platform accounts configured via
social_source_configswith platform account IDs and handles.
Cron (every 2h) / Manual trigger
|
v
Collector Orchestrator
/ \
Search Collection Direct Collection
\ /
v
Queue Dispatcher --> SOCIAL_PROCESSING_QUEUE --> Social Processor
Schema
social_source_configs
| Column | Type | Notes |
|---|---|---|
| id | text PK | |
| org_id | text | Organization ID |
| platform | text | twitter, reddit, instagram, tiktok, linkedin, facebook, youtube, news |
| platform_account_id | text | Platform-specific account identifier |
| account_handle | text | Display handle (e.g., @brand) |
| enabled | integer | 1 = active, 0 = disabled |
| last_cursor | text | Pagination cursor for incremental fetching |
| last_fetched_at | integer | Timestamp of last fetch |
| created_at | integer | |
| updated_at | integer |
social_search_configs
| Column | Type | Notes |
|---|---|---|
| id | text PK | |
| org_id | text | Organization ID |
| keywords | text | JSON array of search keywords |
| brands | text | JSON array of brand names |
| region | text | NA-EN, EU-Multi, AP-Multi, all |
| enabled | integer | 1 = active, 0 = disabled |
| created_at | integer | |
| updated_at | integer |
seen_links
| Column | Type | Notes |
|---|---|---|
| url | text PK | Deduplicated URL |
| org_id | text | Organization ID |
| first_seen | integer | When first discovered |
| last_checked | integer | When last checked for changes |
| content_hash | text | Hash for change detection |
| source | text | Collection source type |
Routes
| Method | Path | Description |
|---|---|---|
| GET | / | Health check with service info |
| GET | /api/v1/configs/{orgId} | List source configs for an organization |
| POST | /api/v1/configs | Create a source config |
| PUT | /api/v1/configs/{id} | Update a source config |
| DELETE | /api/v1/configs/{id} | Delete a source config |
| GET | /api/v1/search-configs/{orgId} | List search configs for an organization |
| POST | /api/v1/search-configs | Create a search config |
| PUT | /api/v1/search-configs/{id} | Update a search config |
| DELETE | /api/v1/search-configs/{id} | Delete a search config |
| POST | /api/v1/collect/{orgId} | Manually trigger collection for an organization |
| GET | /docs | OpenAPI documentation |
Cron Schedule
The service has a scheduled handler triggered every 2 hours (0 */2 * * *). On each invocation it:
- Queries all enabled source and search configs
- Groups configs by organization
- Runs search collection and direct collection for each org
- Dispatches collected items to the
SOCIAL_PROCESSING_QUEUE
Environment Variables
| Variable | Example |
|---|---|
| ENVIRONMENT | dev |
| AI_GATEWAY_ID | crow-ai-gateway |
Secrets
| Secret | Purpose |
|---|---|
| TAVILY_API_KEY | Tavily web search API for keyword-based collection |
| SYSTEM_SECRET | System-level authentication |
| INTERNAL_GATEWAY_KEY | Internal service authentication |
Bindings
| Binding | Type | Name |
|---|---|---|
| DB | D1 | crow-social-collector-db |
| AI | Workers AI | AI inference (search query generation) |
| SOCIAL_PROCESSING_QUEUE | Queue | crow-social-processing-queue |
Dependencies
- Inbound: cron trigger, manual API calls
- Outbound: Tavily API (search), social processor (via Queue)
Key Behaviors
- Deduplication: Tracks seen URLs in the
seen_linkstable to avoid re-processing - Per-org isolation: Each organization's collection runs independently; one failure does not block others
- Platform support: Supports Twitter, Reddit, Instagram, TikTok, LinkedIn, Facebook, YouTube, and news sources
- Region filtering: Search configs can target specific regions (NA-EN, EU-Multi, AP-Multi, or all)
Deployment
cd core-social-collector
npx wrangler deploy # prod
npx wrangler deploy --env dev # dev