Skip to content

Ceres

Harvest-first toolkit for open data portals

Ceres harvests metadata from open data portals and keeps a local catalog in sync over time. Embeddings, semantic search, and exports sit on top as optional layers.

Named after the Roman goddess of harvest and agriculture.

Harvesting Is The Hard Part

Open data work usually breaks on synchronization, portal quirks, and stale records before search even starts.

Portals Are Fragmented

CKAN and DCAT portals expose different capabilities, languages, and reliability profiles.

Embedding Should Be Optional

Many teams need a trustworthy harvested catalog first, then decide later whether to add local or hosted embeddings.

Ceres is designed around that order of work: harvest first, embed later if useful, search once the catalog is ready.

Open Data Galaxy — ML-generated visualization
The published Open Data Index is one downstream product of the pipeline: harvest and normalize first, then optionally embed, search, export, and visualize.

Harvest First

Stream metadata from CKAN and DCAT udata portals into PostgreSQL, track sync history, detect stale datasets, and keep the catalog current even when embeddings are disabled.

Decoupled Pipeline

Harvesting and embedding are separate services. Run metadata-only syncs, backfill embeddings later, or switch provider without re-harvesting your sources.

Local Embeddings By Default

When you do want vectors, Ollama gives you a local zero-cost path. Gemini and OpenAI remain supported, but the project no longer assumes cloud embeddings are required.

Operations Layer

  • CLI: harvest, embed, search, export, stats
  • API: Axum server with Swagger UI
  • Jobs: database-backed harvest queue with retries
  • Exports: JSON, JSONL, CSV, and curated Parquet

Ceres Architecture Diagram

Step 1: Harvest

Use ceres harvest to populate the catalog, optionally in metadata-only mode. Incremental sync, delta detection, and stale marking reduce churn and keep memory bounded.

Step 2: Embed

Run ceres embed when you want vectors. Ollama is the preferred local option; hosted providers are still available for teams that want them.

Step 3: Search Or Export

Search, API access, HuggingFace exports, and downstream analytics all build on the same harvested catalog.

Now (v0.3.5)

Harvest-first sync pipeline, CKAN plus DCAT udata support, standalone embedding passes, local Ollama support, job-driven API operations, and export flows for downstream publishing.

Next (v0.4.0)

Scale and ecosystem tuning: HNSW optimization, more portal client coverage, stronger multi-tenant operations, and better downstream dataset publishing ergonomics.