Building RAG that doesn't lie

Most RAG demos look great until you ask a question the docs don't answer. Then the model fills the gap with confident nonsense. This is a note on the parts that matter for production.

Chunking is a retrieval decision

Chunk size is not a formatting choice, it shapes recall. Too large and the embedding blurs across topics. Too small and you lose the context that makes a passage useful. Start with semantic boundaries (headings, paragraphs), not a fixed token count.

Rerank before you trust top-k

Vector search gives you candidates, not answers. A cross-encoder reranker over the top 20 candidates consistently beats raw cosine similarity. The extra latency is worth it.

scores = reranker.predict([(query, doc) for doc in candidates])
ranked = [doc for _, doc in sorted(zip(scores, candidates), reverse=True)]

The eval harness is the product

If you can't measure hallucination, you can't fix it. Build a small labeled set of question, expected-answer, must-cite-source triples. Score every change against it. A RAG system without an eval loop is a vibe, not a system.

Ship the eval harness before you ship the chatbot.