Skip to content

Harvesting in Ceres

Ceres harvests dataset metadata from open data portals and indexes them with vector embeddings for semantic search. Since portals can host tens of thousands of datasets, and embedding generation requires paid API calls, Ceres implements a two-tier optimization strategy to minimize both portal API calls and embedding API costs.

Harvesting Flow Diagram

Incremental sync reduces portal calls, delta detection reduces embedding calls.

Since v0.3.0, the harvest pipeline streams datasets through the processing stages instead of loading all datasets into memory at once. This allows Ceres to handle very large portals (100k+ datasets) with a constant memory footprint. Batched embedding API calls further improve throughput by sending multiple texts per request.

Incremental sync uses CKAN’s package_search API with a metadata_modified filter to fetch only datasets that have changed since the last successful sync. The last sync timestamp is stored in the portal_sync_status table.

On the first sync for a portal, or when --full-sync is passed, Ceres performs a full sync by listing all dataset IDs and fetching each one. If an incremental sync attempt fails (e.g., the portal doesn’t support the metadata_modified filter), Ceres automatically falls back to a full sync.

What it saves: portal API calls. Instead of fetching metadata for all 50,000 datasets on every run, only the handful that changed since the last sync are fetched.

Even when a dataset is fetched (because its metadata_modified timestamp changed), its embeddable content (title + description) may not have actually changed. For example, adding a tag or updating a resource URL changes metadata_modified but doesn’t affect what Ceres embeds.

Delta detection computes a SHA-256 hash of title + description (the content_hash) and compares it against the stored hash. If the hash matches, the embedding regeneration is skipped entirely.

What it saves: embedding API calls. In typical runs, 99%+ of datasets have unchanged content, saving significant API costs.

Scenariometadata_modified changed?content_hash changed?Action
Tag added to datasetYesNoFetch metadata, skip embedding
Resource URL updatedYesNoFetch metadata, skip embedding
Title rewrittenYesYesFetch metadata, regenerate embedding
New dataset publishedN/A (new)N/A (new)Fetch metadata, generate embedding
Nothing changedNoN/A (not fetched)Not fetched at all

Without incremental sync, every run would fetch all datasets from the portal. Without delta detection, every fetched dataset would trigger an embedding API call. Together, they minimize both sources of cost.

Each dataset processed during a sync receives one of these outcomes:

OutcomeMeaningEmbedding generated?
CreatedNew dataset, not seen beforeYes
UpdatedContent hash changed (title or description modified)Yes
UnchangedContent hash matches stored valueNo
FailedError during processingNo
SkippedCircuit breaker is open, dataset skippedNo

These are tracked via SyncStats and reported at the end of each sync operation.

FlagTier 1 (Incremental)Tier 2 (Delta Detection)Use case
(none)Incremental if previous sync existsAlways activeNormal operation
--full-syncFull sync forcedStill activeRe-scan portal after known issues
--dry-runDry run (no writes)Still activePreview what would happen

Delta detection is always active regardless of flags. There is no flag to bypass it — if you need to force full re-embedding, delete the stored content hashes from the database.

The portal_sync_status table tracks sync history per portal:

ColumnTypePurpose
portal_urlVARCHAR (PK)Portal identifier
last_successful_syncTIMESTAMPTZTimestamp used for next incremental sync
last_sync_modeVARCHAR(20)"full" or "incremental"
sync_statusVARCHAR(20)"completed" or "cancelled"
datasets_syncedINTEGERNumber of datasets processed
updated_atTIMESTAMPTZWhen this record was last updated

The last_successful_sync value is set to the sync start time (not end time), ensuring no datasets are missed between syncs.

Content hashes are stored in the datasets table in the content_hash column (VARCHAR(64), nullable for backward compatibility with records indexed before delta detection was added).

The embedding API (Gemini) is protected by a circuit breaker to prevent cascading failures during harvesting:

Circuit Breaker Diagram

Closed, Open, Half-Open states with adaptive recovery timeout on rate limits.

  • Closed: requests flow normally
  • Open: all embedding requests are rejected immediately, datasets are recorded as Skipped
  • Half-Open: requests are allowed to probe recovery; 2 successes close the circuit, any failure reopens it

On HTTP 429 (rate limit), the recovery timeout is multiplied by a backoff factor (default 2x), up to a maximum of 5 minutes. Configuration is via environment variables: CB_FAILURE_THRESHOLD, CB_RECOVERY_TIMEOUT_SECS, CB_SUCCESS_THRESHOLD, CB_RATE_LIMIT_BACKOFF_MULTIPLIER, CB_MAX_RECOVERY_TIMEOUT_SECS.