Building a Production Legal AI Database in the Philippines: BatasDB Case Study

BatasDB is a production legal AI database built in the Philippines, for Philippine law firms and in-house legal teams. It indexes the full body of Philippine statutes, case law, and administrative regulations, and lets lawyers ask questions in plain English and get answers grounded in real Philippine legal sources.

The problem

Philippine legal research lives in a fragmented mess of government PDFs, Supreme Court e-libraries with broken search, and paid subscription services that lawyers in smaller practices can’t justify. A simple question — “what are the latest cases on probationary employment?” — can take an hour of source hunting.

The legal community wanted a single searchable database with natural-language queries and trustworthy citations. The existing tools either covered too narrow a slice (just statutes, no case law) or returned ranked results that were obviously wrong on technical queries (citation strings, acronyms, exact phrases).

What we built

BatasDB indexes the full body of Philippine statutes, case law, executive orders, and administrative regulations. Lawyers query in natural language. The system returns ranked results with citations, plus an AI-synthesized answer when appropriate, with every claim linked to a source.

Architecture decisions that mattered

Hybrid retrieval from day one. Pure semantic search broke on the kind of exact-string queries lawyers actually type (citation numbers, statute references). Pure keyword search missed paraphrased intent. The system combines both and uses rank-based fusion to merge the results.

Postgres with vector indexing instead of a dedicated vector DB. A single managed Postgres instance handles the relational data and the vector search side by side. Avoiding a second database means fewer moving parts, simpler backups, and transactional consistency between document metadata and embeddings.

Citation extraction on every AI answer. The first version of the chatbot occasionally invented citations. The current pipeline forces the model to map every claim back to a specific source, then verifies each mapping before the answer reaches the user. Unsupported claims are dropped. The pattern is covered in our chatbot hallucination explainer.

FastAPI for streaming. Lawyers asking research questions want to see answers appear progressively. FastAPI’s async + SSE story made streaming clean. Rails would have worked but with more friction.

Eval harness as a first-class artifact. 240 hand-labeled queries with expected source chunks. Runs on every retrieval-pipeline change. Tracks recall@10, recall@20, and MRR. Without it, we’d have shipped quality regressions invisibly.

What we learned

Vector search alone is a trap for technical domains. Any corpus with exact identifiers, acronyms, or rare-but-precise terms needs hybrid retrieval. We learned this on legal documents but the pattern applies to medical, regulatory, scientific, and product-catalog domains equally.

Quality is invisible without an eval. We tried swapping our default embedding model for a more expensive, supposedly better one and the eval showed it was marginally worse on our corpus. Without measurement, we’d have shipped the regression and paid more for it.

RAG citation hallucination is the silent killer. Users trust footnotes implicitly. A chatbot that confidently cites the wrong source is more dangerous than one that admits it doesn’t know. Structured verification is non-negotiable.

Who BatasDB is for

Small and mid-sized Philippine law firms, in-house counsel at Philippine companies, and legal aid organizations who need fast, citation-backed legal research without the cost of an enterprise subscription. The architecture also generalizes — if you’re a Philippine business with a private corpus of documents (contracts, policy archives, regulatory filings) and you want a private internal search-and-answer tool with the same rigor, the same approach applies.

Stack

A Python backend for the async-heavy retrieval pipeline, Postgres with both lexical and vector indexes in a single database, and a mix of hosted LLMs depending on query type. We keep infrastructure simple — one database, one provider per concern — so the system stays operable.

Visit the live product at batasdb.ph.