Ceres

Harvest-first toolkit for open data portals

Ceres harvests metadata from open data portals and keeps a local catalog in sync over time. Embeddings, semantic search, and exports sit on top as optional layers.

Named after the Roman goddess of harvest and agriculture.

Why Ceres

Harvesting Is The Hard Part

Open data work usually breaks on synchronization, portal quirks, and stale records before search even starts.

Portals Are Fragmented

CKAN and DCAT portals expose different capabilities, languages, and reliability profiles.

Embedding Should Be Optional

Many teams need a trustworthy harvested catalog first, then decide later whether to add local or hosted embeddings.

Ceres is designed around that order of work: harvest first, embed later if useful, search once the catalog is ready.

Open Data Galaxy — ML-generated visualization

_{The published Open Data Index is one downstream product of the pipeline: harvest and normalize first, then optionally embed, search, export, and visualize.}

See It In Action

Features and Scale

Harvest First

Stream metadata from CKAN, DCAT udata, and SPARQL-backed DCAT portals into PostgreSQL, track sync history, detect stale datasets, and keep the catalog current even when embeddings are disabled.

Decoupled Pipeline

Harvesting and embedding are separate services. Run metadata-only syncs, backfill embeddings later, or switch provider without re-harvesting your sources.

Local Embeddings By Default

When you do want vectors, Ollama gives you a local zero-cost path. Gemini and OpenAI remain supported, but the project no longer assumes cloud embeddings are required.

Operations Layer

CLI: harvest, embed, search, export, stats
API: Axum server with Swagger UI
Jobs: database-backed harvest queue with retries
Exports: JSON, JSONL, CSV, and curated Parquet

Architecture

Ceres Architecture Diagram

Operational Model

Step 1: Harvest

Use ceres harvest to populate the catalog, optionally in metadata-only mode. Incremental sync, delta detection, and stale marking reduce churn and keep memory bounded.

Step 2: Embed

Run ceres embed when you want vectors. Ollama is the preferred local option; hosted providers are still available for teams that want them.

Step 3: Search Or Export

Search, API access, HuggingFace exports, and downstream analytics all build on the same harvested catalog.

Roadmap

Now (v0.4.0)

Performance at scale and ecosystem consolidation: tuned HNSW vector index, dynamic ef_search, SPARQL-backed DCAT harvesting (e.g. data.europa.eu), more resilient batch embedding under provider failures, and internal refactors for maintainability.

Next (v0.5.0)

Broader portal coverage (Socrata), multi-tenant operations, webhooks, and a Parquet export endpoint on the REST API.