BatasDB: hybrid search across millions of Philippine legal documents
How we built a production-grade legal research database with hybrid BM25 + vector retrieval, citation-grounded AI answers, and 3.2 million indexed chunks.
The problem
Philippine legal research lives in a fragmented mess of government PDFs, Supreme Court e-libraries with broken search, and paid subscription services that lawyers in smaller practices can’t justify. A simple question — “what are the latest cases on probationary employment?” — can take an hour of source hunting.
The legal community wanted a single searchable database with natural-language queries and trustworthy citations. The existing tools either covered too narrow a slice (just statutes, no case law) or returned ranked results that were obviously wrong on technical queries (citation strings, acronyms, exact phrases).
What we built
BatasDB indexes the full body of Philippine statutes, case law, executive orders, and administrative regulations — roughly 3.2 million searchable chunks. Lawyers query in natural language. The system returns ranked results with citations, plus an AI-synthesized answer when appropriate, with every claim linked to a source.
Architecture decisions that mattered
Hybrid retrieval from day one. Pure semantic search broke on the kind of exact-string queries lawyers actually type (citation numbers, statute references). Pure keyword search missed paraphrased intent. The system combines both and uses rank-based fusion to merge the results.
Postgres with vector indexing instead of a dedicated vector DB. A single managed Postgres instance handles the relational data and the vector search side by side. Avoiding a second database means fewer moving parts, simpler backups, and transactional consistency between document metadata and embeddings.
Citation extraction on every AI answer. The first version of the chatbot occasionally invented citations. The current pipeline forces the model to map every claim back to a specific source, then verifies each mapping before the answer reaches the user. Unsupported claims are dropped. The pattern is covered in our chatbot hallucination explainer.
FastAPI for streaming. Lawyers asking research questions want to see answers appear progressively. FastAPI’s async + SSE story made streaming clean. Rails would have worked but with more friction.
Eval harness as a first-class artifact. 240 hand-labeled queries with expected source chunks. Runs on every retrieval-pipeline change. Tracks recall@10, recall@20, and MRR. Without it, we’d have shipped quality regressions invisibly.
What we learned
Vector search alone is a trap for technical domains. Any corpus with exact identifiers, acronyms, or rare-but-precise terms needs hybrid retrieval. We learned this on legal documents but the pattern applies to medical, regulatory, scientific, and product-catalog domains equally.
Quality is invisible without an eval. We tried swapping our default embedding model for a more expensive, supposedly better one and the eval showed it was marginally worse on our corpus. Without measurement, we’d have shipped the regression and paid more for it.
RAG citation hallucination is the silent killer. Users trust footnotes implicitly. A chatbot that confidently cites the wrong source is more dangerous than one that admits it doesn’t know. Structured verification is non-negotiable.
Stack
A Python backend for the async-heavy retrieval pipeline, Postgres with both lexical and vector indexes in a single database, and a mix of hosted LLMs depending on query type. We keep infrastructure simple — one database, one provider per concern — so the system stays operable.
Visit the live product at batasdb.ph.