Skip to main content

core-social-collector

Social media collection service that gathers mentions, reviews, and brand-related content from social platforms. Runs on a cron schedule (every 2 hours) and supports manual triggering. Collected items are dispatched to a Queue for downstream processing by the social processor.

Worker name: crow-social-collector Domain (prod): social-collector.crowai.dev Domain (dev): dev.social-collector.crowai.dev

Architecture

The collector supports two collection strategies:

  1. Search collection -- Uses Tavily API to search for brand mentions and keywords across the web. Configured via social_search_configs with keywords, brands, and region filters.
  2. Direct collection -- Scrapes specific social platform accounts configured via social_source_configs with platform account IDs and handles.
Cron (every 2h) / Manual trigger
|
v
Collector Orchestrator
/ \
Search Collection Direct Collection
\ /
v
Queue Dispatcher --> SOCIAL_PROCESSING_QUEUE --> Social Processor

Schema

social_source_configs

ColumnTypeNotes
idtext PK
org_idtextOrganization ID
platformtexttwitter, reddit, instagram, tiktok, linkedin, facebook, youtube, news
platform_account_idtextPlatform-specific account identifier
account_handletextDisplay handle (e.g., @brand)
enabledinteger1 = active, 0 = disabled
last_cursortextPagination cursor for incremental fetching
last_fetched_atintegerTimestamp of last fetch
created_atinteger
updated_atinteger

social_search_configs

ColumnTypeNotes
idtext PK
org_idtextOrganization ID
keywordstextJSON array of search keywords
brandstextJSON array of brand names
regiontextNA-EN, EU-Multi, AP-Multi, all
enabledinteger1 = active, 0 = disabled
created_atinteger
updated_atinteger
ColumnTypeNotes
urltext PKDeduplicated URL
org_idtextOrganization ID
first_seenintegerWhen first discovered
last_checkedintegerWhen last checked for changes
content_hashtextHash for change detection
sourcetextCollection source type

Routes

MethodPathDescription
GET/Health check with service info
GET/api/v1/configs/{orgId}List source configs for an organization
POST/api/v1/configsCreate a source config
PUT/api/v1/configs/{id}Update a source config
DELETE/api/v1/configs/{id}Delete a source config
GET/api/v1/search-configs/{orgId}List search configs for an organization
POST/api/v1/search-configsCreate a search config
PUT/api/v1/search-configs/{id}Update a search config
DELETE/api/v1/search-configs/{id}Delete a search config
POST/api/v1/collect/{orgId}Manually trigger collection for an organization
GET/docsOpenAPI documentation

Cron Schedule

The service has a scheduled handler triggered every 2 hours (0 */2 * * *). On each invocation it:

  1. Queries all enabled source and search configs
  2. Groups configs by organization
  3. Runs search collection and direct collection for each org
  4. Dispatches collected items to the SOCIAL_PROCESSING_QUEUE

Environment Variables

VariableExample
ENVIRONMENTdev
AI_GATEWAY_IDcrow-ai-gateway

Secrets

SecretPurpose
TAVILY_API_KEYTavily web search API for keyword-based collection
SYSTEM_SECRETSystem-level authentication
INTERNAL_GATEWAY_KEYInternal service authentication

Bindings

BindingTypeName
DBD1crow-social-collector-db
AIWorkers AIAI inference (search query generation)
SOCIAL_PROCESSING_QUEUEQueuecrow-social-processing-queue

Dependencies

  • Inbound: cron trigger, manual API calls
  • Outbound: Tavily API (search), social processor (via Queue)

Key Behaviors

  • Deduplication: Tracks seen URLs in the seen_links table to avoid re-processing
  • Per-org isolation: Each organization's collection runs independently; one failure does not block others
  • Platform support: Supports Twitter, Reddit, Instagram, TikTok, LinkedIn, Facebook, YouTube, and news sources
  • Region filtering: Search configs can target specific regions (NA-EN, EU-Multi, AP-Multi, or all)

Deployment

cd core-social-collector
npx wrangler deploy # prod
npx wrangler deploy --env dev # dev