Building RAG that doesn't lie
Most RAG demos look great until you ask a question the docs don't answer. Then the model fills the gap with confident nonsense. This is a note on the parts that matter for production.
Chunking is a retrieval decision
Chunk size is not a formatting choice, it shapes recall. Too large and the embedding blurs across topics. Too small and you lose the context that makes a passage useful. Start with semantic boundaries (headings, paragraphs), not a fixed token count.
Rerank before you trust top-k
Vector search gives you candidates, not answers. A cross-encoder reranker over the top 20 candidates consistently beats raw cosine similarity. The extra latency is worth it.
scores = reranker.predict([(query, doc) for doc in candidates])
ranked = [doc for _, doc in sorted(zip(scores, candidates), reverse=True)]
The eval harness is the product
If you can't measure hallucination, you can't fix it. Build a small labeled set of question, expected-answer, must-cite-source triples. Score every change against it. A RAG system without an eval loop is a vibe, not a system.
Ship the eval harness before you ship the chatbot.