Keyword Match Fails
“Public transport” won’t find datasets tagged as “mobility data” or “bus schedules”.
Ceres harvests metadata from CKAN open data portals and indexes them with vector embeddings, enabling semantic search across fragmented data sources.
Named after the Roman goddess of harvest and agriculture.
Keyword Match Fails
“Public transport” won’t find datasets tagged as “mobility data” or “bus schedules”.
Fragmented Ecosystems
Italy alone has over 20+ regional portals with completely different interfaces.
No Cross-Querying
You fundamentally cannot run a single query across Milano, Roma, and Napoli datasets at once.
Ceres solves this by creating a unified semantic index. You search once, across all harvested portals, by meaning!

Fetch datasets from any CKAN-compatible portal, including multi-lingual ones. Harvest multiple portals simultaneously via portals.toml configurations. Recoverable, fault-tolerant job queue built-in.
A memory-efficient streaming pipeline handles massive portals (100k+ datasets). Built-in Smart Delta Detection skips embedding un-changed entities — yielding 99.8% API cost savings per run.
Currently powers an index of 354,000+ datasets across 25 pre-verified portals (from Australia to Italy to HDX). All exposed via Axum REST APIs with OpenAPI interfaces and OpenAPI docs.
pgvector
Multi-portal streaming harvests, Gemini / OpenAI embeddings, Advanced CLI, PostgreSQL + pgvector, Axum REST API, CSV/JSON/Parquet export capabilities.
Scale and ecosystem tuning: Production HNSW index tuning, Multi-tenancy support, Local embeddings via Ollama, Schema-level search, Socrata/DCAT-AP expansion.