Back to Blog
AI Solutions15 min readApril 28, 2026

How to Build a RAG Chatbot with LangChain and Pinecone (Production Guide)

Most RAG tutorials stop at 'it works in the notebook.' Here is how to build a retrieval-augmented chatbot that holds up in production — chunking strategy, hybrid retrieval, prompt engineering for RAG, and how to evaluate accuracy before shipping.

RAGLangChainVector DBOpenAI
How to Build a RAG Chatbot with LangChain and Pinecone (Production Guide)

RAG (retrieval-augmented generation) sounds straightforward: embed your documents, store them in a vector database, retrieve relevant chunks at query time, and pass them to an LLM as context. In practice, there are at least a dozen ways to get this wrong in production.

Chunking strategy matters more than the model

Your retrieval quality is largely determined by how you chunk your source documents. Chunks too small: each one lacks enough context to be useful. Chunks too large: you burn context window budget on irrelevant content and retrieval similarity degrades.

python
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Recursive splitting respects document structure —
# splits on paragraphs, then sentences, then words
splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=100,  # Overlap prevents context loss at chunk boundaries
    separators=["\n\n", "\n", ". ", " ", ""],
    length_function=len,
)

chunks = splitter.split_documents(documents)

# Add metadata to every chunk for filtering and attribution
for chunk in chunks:
    chunk.metadata.update({
        "source_id": source_id,
        "tenant_id": tenant_id,
        "indexed_at": datetime.utcnow().isoformat(),
    })

Hybrid retrieval outperforms pure semantic search

Dense retrieval (embedding similarity) misses exact keyword matches. A product name, an error code, a person's name — these often retrieve poorly from embedding search because the semantic space doesn't capture lexical specificity. Combining dense retrieval with BM25 sparse retrieval consistently produces better results across diverse query types.

Evaluating your RAG pipeline

The question most teams skip: how do you know your RAG is actually working? The minimum viable evaluation: build a test set of 20–30 question/answer pairs from your knowledge base. For each pair, check two things: (1) did the retrieved context contain the answer, and (2) did the LLM use that context correctly. These can fail independently.

Retrieval accuracy and generation accuracy are separate concerns. A perfectly accurate retrieval system paired with a poorly prompted LLM will still give bad answers — and vice versa. Measure them independently.

Prompt engineering for RAG

python
SYSTEM_PROMPT = """You are a helpful assistant for {company_name}.
Answer questions based ONLY on the provided context.
If the context doesn't contain enough information to answer confidently,
say so clearly — do not speculate or make things up.
Always cite which part of the context you used."""

def build_prompt(query: str, context_chunks: list[str]) -> list[dict]:
    context = "\n\n---\n\n".join(context_chunks)
    return [
        {"role": "system", "content": SYSTEM_PROMPT},
        {
            "role": "user",
            "content": f"Context:\n{context}\n\nQuestion: {query}"
        },
    ]

Auravon AI

Engineering Studio

Get Practical Engineering Insights

Articles like this one, delivered to your inbox. No filler, no news roundups — just engineering practice.