Skip to content

Harvesting in Ceres

Ceres is organized around harvesting first.

The primary job of the system is to pull dataset metadata from portal APIs, normalize it, and keep a local catalog synchronized over time. Embeddings are a separate stage that can run later through Ollama or a hosted provider.

Today the shipping portal clients cover:

  • CKAN portals
  • DCAT-AP portals that expose the udata REST JSON-LD catalog
Portal API -> PortalClient -> HarvestService -> DatasetStore
|
+-> sync history
+-> stale detection
+-> pending embeddings
DatasetStore (pending) -> EmbeddingService -> vectors for search

Harvesting Flow Diagram

Incremental sync reduces portal calls. Delta detection reduces optional embedding work.

The harvest pipeline streams datasets through processing stages instead of loading an entire catalog into memory. That keeps memory bounded even on very large portals.

Embedding is no longer part of the mandatory hot path. When enabled, it runs through a separate service and can batch texts according to provider capabilities.

When a portal supports modified-since querying, Ceres fetches only datasets changed since the last successful sync. The last sync timestamp is stored in portal_sync_status.

On the first sync for a portal, or when --full-sync is passed, Ceres performs a full sync. If incremental sync is unsupported or fails, the service falls back to a full sync automatically.

This is what keeps repeated harvests operationally cheap.

Even when a dataset is fetched, its embeddable content may not have changed. A portal can update tags, resources, or minor metadata without changing the text that would be embedded.

Delta detection computes a SHA-256 hash of title + description (the content_hash) and compares it against the stored hash. If the hash matches, the embedding regeneration is skipped entirely.

This matters most when you run the optional embedding stage, whether locally through Ollama or through a hosted provider.

Scenariometadata_modified changed?content_hash changed?Action
Tag added to datasetYesNoFetch metadata, skip embedding
Resource URL updatedYesNoFetch metadata, skip embedding
Title rewrittenYesYesFetch metadata, mark for embedding
New dataset publishedN/A (new)N/A (new)Fetch metadata, mark for embedding
Nothing changedNoN/A (not fetched)Not fetched at all

Without incremental sync, every run would fetch the full portal. Without delta detection, every changed record would be re-embedded even when the relevant text stayed the same.

Each dataset processed during a sync receives one of these outcomes:

OutcomeMeaningEmbedding generated?
CreatedNew dataset, not seen beforeMarked pending
UpdatedContent hash changed (title or description modified)Marked pending
UnchangedContent hash matches stored valueNo
FailedError during processingNo
SkippedEmbedding step was skipped or the circuit breaker is openNo

These are tracked via SyncStats and reported at the end of each sync operation.

FlagTier 1 (Incremental)Tier 2 (Delta Detection)Use case
(none)Incremental if previous sync existsAlways activeNormal operation
--full-syncFull sync forcedStill activeRe-scan portal after known issues
--dry-runDry run (no writes)Still activePreview what would happen
--metadata-onlySame as defaultSkipped (no embedding)Harvest without API key

Delta detection is always active regardless of flags. There is no flag to bypass it — if you need to force full re-embedding, delete the stored content hashes from the database.

Metadata-only mode is the normal harvest path

Section titled “Metadata-only mode is the normal harvest path”

--metadata-only is not a degraded mode. It is the cleanest way to operate Ceres when your immediate goal is harvesting and synchronization.

Use it when you want to:

  • build the catalog before choosing an embedding provider
  • run fully locally without any vector generation
  • separate crawl operations from search operations
  • backfill vectors later with ceres embed

The portal_sync_status table tracks sync history per portal:

ColumnTypePurpose
portal_urlVARCHAR (PK)Portal identifier
last_successful_syncTIMESTAMPTZTimestamp used for next incremental sync
last_sync_modeVARCHAR(20)"full" or "incremental"
sync_statusVARCHAR(20)"completed" or "cancelled"
datasets_syncedINTEGERNumber of datasets processed
updated_atTIMESTAMPTZWhen this record was last updated

The last_successful_sync value is set to the sync start time (not end time), ensuring no datasets are missed between syncs.

Content hashes are stored in the datasets table in the content_hash column (VARCHAR(64), nullable for backward compatibility with records indexed before delta detection was added).

Embeddings are processed later by EmbeddingService:

  • HarvestService writes datasets and tracks changes
  • EmbeddingService reads pending rows and generates vectors
  • HarvestPipeline composes both when you want the combined workflow

This split lets you harvest regardless of embedding availability and makes Ollama a practical local-first default.

When embeddings are enabled, the embedding provider is protected by a circuit breaker to avoid cascading failures:

Circuit Breaker Diagram

Closed, Open, Half-Open states with adaptive recovery timeout on rate limits.

  • Closed: requests flow normally
  • Open: all embedding requests are rejected immediately, datasets are recorded as Skipped
  • Half-Open: requests are allowed to probe recovery; 2 successes close the circuit, any failure reopens it

On HTTP 429, the recovery timeout is multiplied by a backoff factor, up to a configured maximum.

After a successful full sync with zero failures and zero skipped datasets, Ceres marks datasets that no longer exist on the portal as stale. This uses an efficient exclusion-based approach: all datasets whose original_id is NOT in the set of IDs seen during the sync are marked is_stale = TRUE.

Stale datasets are:

  • Excluded from semantic search (WHERE NOT is_stale)
  • Excluded from pending embeddings (via partial index)
  • Not deleted — soft-marked so they can be recovered if the portal re-publishes them

Stale detection only runs on full syncs because incremental syncs fetch only modified datasets and cannot definitively determine which datasets have been removed.

The exact harvest behavior depends on the portal client:

  • CKAN clients can use modified-since filters and adaptive page sizing
  • DCAT udata clients stream paginated JSON-LD catalog pages and resolve multilingual fields according to the configured language

The CKAN client uses adaptive page size reduction to handle portals that truncate or timeout on large responses:

  • Initial page size: 1000 rows
  • On Timeout or NetworkError: quarters the page size (1000 → 250 → 62 → 15 → 10)
  • Minimum page size: 10 rows
  • On other errors (rate limits, client errors): no reduction, error propagated normally

This converges faster than halving and handles portals with resource-heavy datasets at specific offsets.