RAG (retrieval-augmented generation) sounds straightforward: embed your documents, store them in a vector database, retrieve relevant chunks at query time, and pass them to an LLM as context. In practice, there are at least a dozen ways to get this wrong in production.
Chunking strategy matters more than the model
Your retrieval quality is largely determined by how you chunk your source documents. Chunks too small: each one lacks enough context to be useful. Chunks too large: you burn context window budget on irrelevant content and retrieval similarity degrades.
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Recursive splitting respects document structure —
# splits on paragraphs, then sentences, then words
splitter = RecursiveCharacterTextSplitter(
chunk_size=800,
chunk_overlap=100, # Overlap prevents context loss at chunk boundaries
separators=["\n\n", "\n", ". ", " ", ""],
length_function=len,
)
chunks = splitter.split_documents(documents)
# Add metadata to every chunk for filtering and attribution
for chunk in chunks:
chunk.metadata.update({
"source_id": source_id,
"tenant_id": tenant_id,
"indexed_at": datetime.utcnow().isoformat(),
})Hybrid retrieval outperforms pure semantic search
Dense retrieval (embedding similarity) misses exact keyword matches. A product name, an error code, a person's name — these often retrieve poorly from embedding search because the semantic space doesn't capture lexical specificity. Combining dense retrieval with BM25 sparse retrieval consistently produces better results across diverse query types.
Evaluating your RAG pipeline
The question most teams skip: how do you know your RAG is actually working? The minimum viable evaluation: build a test set of 20–30 question/answer pairs from your knowledge base. For each pair, check two things: (1) did the retrieved context contain the answer, and (2) did the LLM use that context correctly. These can fail independently.
Retrieval accuracy and generation accuracy are separate concerns. A perfectly accurate retrieval system paired with a poorly prompted LLM will still give bad answers — and vice versa. Measure them independently.
Prompt engineering for RAG
SYSTEM_PROMPT = """You are a helpful assistant for {company_name}.
Answer questions based ONLY on the provided context.
If the context doesn't contain enough information to answer confidently,
say so clearly — do not speculate or make things up.
Always cite which part of the context you used."""
def build_prompt(query: str, context_chunks: list[str]) -> list[dict]:
context = "\n\n---\n\n".join(context_chunks)
return [
{"role": "system", "content": SYSTEM_PROMPT},
{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {query}"
},
]Auravon AI
Engineering Studio