RAG Chatbot Development

Retrieval-augmented chatbots that read your documents and cite their sources. Built with hybrid retrieval, evaluated continuously, deployed to your cloud.

Why RAG, and where it actually helps

The first time someone deploys a chatbot directly on a vanilla LLM and watches it confidently invent product features, refund policies, or legal citations, they discover why retrieval matters. RAG is the standard fix: ground the model in your actual content and refuse to answer outside of it.

Where RAG shines: documentation chatbots, customer support over a knowledge base, internal search across SharePoint or Confluence, research assistants over scientific or legal corpora. Where RAG struggles: open-ended creative tasks, math, and queries that require multi-hop reasoning across many documents.

What we build into every RAG system

Hybrid retrieval

BM25 (lexical) plus pgvector or HNSW (semantic), combined with reciprocal rank fusion or a learned re-ranker. We learned this the hard way on BatasDB: lawyers search for exact citation strings ("G.R. No. 123456") that pure vector search embarrassingly misses. Our writeup on hybrid search has the implementation details.

Citation extraction

Every claim in the answer maps back to a source chunk. The UI shows citations the user can verify. Claims without supporting retrieval are filtered or flagged.

Eval harness

A labeled set of representative queries with expected source chunks and answer quality scores. Runs on every prompt change, every embedding model swap, every chunking strategy change. RAG quality is one of the easiest things to silently regress and one of the easiest to catch with an eval.

Chunking that respects structure

Naive 512-token chunks tear sentences in half. We chunk on semantic boundaries (sections, paragraphs, list items) and store both the chunk and its parent document metadata. Retrieval pulls the chunk; the LLM gets enough context to answer well.

Refusal and fallback

When retrieval scores are low, the chatbot says so and either offers to escalate to a human or suggests reformulating. A chatbot that knows when to shut up is far better than one that guesses.

Real example: BatasDB

BatasDB is our legal database product. It indexes the entire body of Philippine statutes and case law and lets lawyers query in natural language with citations. The hybrid search architecture, citation extraction, and eval methodology we use on client projects all came from production lessons learned there. See the case study and the writeup linked below for technical detail.

Related reading

Frequently asked questions

What is RAG and why does it matter?

Retrieval-Augmented Generation. Instead of letting the LLM answer from training data (where it hallucinates), you retrieve relevant passages from your own documents at query time and pass them to the model as context. The model only answers from what it was given. This is the single biggest unlock for trustworthy AI chatbots over a knowledge base.

Why not just use a vector database and call it RAG?

Because pure vector search is bad at queries with specific identifiers, exact phrases, or rare terms. Pure keyword search is bad at semantic similarity. Real production RAG uses hybrid retrieval, BM25 plus vector, with a re-ranker on top. The implementation pattern is in our hybrid-search writeup.

How do you measure RAG quality?

Eval harness. We build a labeled set of representative queries and expected sources, run them on every change, and track retrieval-at-K plus answer faithfulness. Without an eval harness, RAG quality drifts silently. We don't ship without one.

How long does RAG chatbot development take?

A working RAG chatbot over a single document corpus takes 3 to 6 weeks. Adding citations, channel deployments, multi-source retrieval, and an eval harness pushes it to 8 to 12 weeks. Maintaining and tuning it is ongoing — RAG is not fire-and-forget.

Can you deploy RAG on our private cloud?

Yes. We use a managed Postgres with vector indexing on DigitalOcean or Supabase. The LLM can be OpenAI, Anthropic, or a self-hosted open-weight model if data residency matters. The application deploys to DigitalOcean, AWS, or Render.

Need RAG done right?

Tell us about the corpus and the use case. We'll tell you whether RAG is the answer and what the architecture should look like.

[email protected]