Skip to content

Ceres

Semantic search engine for open data portals

Ceres harvests metadata from CKAN open data portals and indexes them with vector embeddings, enabling semantic search across fragmented data sources.

Named after the Roman goddess of harvest and agriculture.

Keyword Match Fails

“Public transport” won’t find datasets tagged as “mobility data” or “bus schedules”.

Fragmented Ecosystems

Italy alone has over 20+ regional portals with completely different interfaces.

No Cross-Querying

You fundamentally cannot run a single query across Milano, Roma, and Napoli datasets at once.

Ceres solves this by creating a unified semantic index. You search once, across all harvested portals, by meaning!

Open Data Galaxy — ML-generated visualization
354,000+ datasets (dedup. to 270k) from 22 portals, embedded with all-MiniLM-L6-v2, projected to 3D via UMAP, and clustered with HDBSCAN. Each color is a portal — nearby points are semantically similar.

Enterprise Harvester

Fetch datasets from any CKAN-compatible portal, including multi-lingual ones. Harvest multiple portals simultaneously via portals.toml configurations. Recoverable, fault-tolerant job queue built-in.

Streaming & Delta Detection

A memory-efficient streaming pipeline handles massive portals (100k+ datasets). Built-in Smart Delta Detection skips embedding un-changed entities — yielding 99.8% API cost savings per run.

Production-Ready Scale

Currently powers an index of 354,000+ datasets across 25 pre-verified portals (from Australia to Italy to HDX). All exposed via Axum REST APIs with OpenAPI interfaces and OpenAPI docs.

Tech Stack & Extensibility

  • Core: Rust (async with Tokio)
  • Database: PostgreSQL 16+ with pgvector
  • Embeddings: Pluggable Backend (Google Gemini, OpenAI)
  • Exports: JSON, CSV, JSON Lines, Parquet Supported!

Ceres Architecture Diagram

Now (v0.3.0)

Multi-portal streaming harvests, Gemini / OpenAI embeddings, Advanced CLI, PostgreSQL + pgvector, Axum REST API, CSV/JSON/Parquet export capabilities.

Next (v0.4.0)

Scale and ecosystem tuning: Production HNSW index tuning, Multi-tenancy support, Local embeddings via Ollama, Schema-level search, Socrata/DCAT-AP expansion.