Retrieval-Augmented Generation (RAG) is the architecture behind almost every production AI chatbot we ship. It is not the only option -- fine-tuning and pure prompt engineering have their place -- but for the 80% of use cases where a product needs to answer questions grounded in a private knowledge base, RAG is faster to ship, cheaper to maintain, and easier to update.

What RAG Actually Does

A pure LLM has two problems for product use cases: its knowledge cuts off at training time, and it cannot access your proprietary data. RAG solves both by adding a retrieval step before generation:

  1. User submits a query.
  2. The query is converted to an embedding (a vector representation).
  3. Similar embeddings are retrieved from your vector store.
  4. The retrieved chunks are injected into the LLM prompt as context.
  5. The LLM generates an answer grounded in that context.

Simple in principle. The complexity is in steps 2, 3, and 4 -- and most production failures happen there.

Chunking: The Problem Everyone Underestimates

How you split your source documents determines the ceiling on your chatbot's accuracy. Get chunking wrong and no amount of prompt engineering saves you.

Naive chunking (do not use in production)

Splitting by fixed character count (chunk_size=512) is fast to implement and consistently mediocre. It cuts sentences mid-thought, severs context between chunks, and produces irrelevant retrievals.

Semantic chunking (use this)

Split on natural boundaries: paragraphs, sections, or -- for structured data -- rows. For long documents with headings, preserve the heading in every child chunk so the model knows the section context even when a chunk is retrieved in isolation.

# Example: heading-aware chunking
def chunk_with_context(text, max_tokens=400):
    sections = split_by_heading(text)
    chunks = []
    for heading, body in sections:
        for para in split_by_paragraph(body, max_tokens):
            chunks.append(f"{heading}\n\n{para}")
    return chunks

Vector Stores: Choosing the Right Database

For most MVPs, you do not need a dedicated vector database. Here is our decision tree:

  • Under 100k vectors, Supabase/pgvector: Zero additional infrastructure. Built into the database you already have. Fast enough for most B2B products at launch.
  • 100k-10M vectors, Pinecone or Weaviate: Managed services with fast approximate nearest-neighbor (ANN) search. Worth the monthly cost above ~500k vectors.
  • 10M+ vectors: Self-hosted Weaviate or Qdrant on dedicated compute. You will know when you are here.

Re-ranking: The Accuracy Multiplier

Retrieval gives you the top-k most similar chunks. Similarity is not the same as relevance. A re-ranking step scores each retrieved chunk against the query using a cross-encoder model -- a slower but much more accurate comparison than embedding similarity alone.

In our deployments, adding a Cohere re-ranker on top of pgvector retrieval consistently improves answer accuracy by 15-30% with negligible latency impact (50-80ms additional per query).

Common Production Failures

Hallucination on out-of-scope queries

When the retrieved context does not contain the answer, a poorly prompted LLM will fabricate one. Fix: instruct the model explicitly to say "I don't know" when context is insufficient, and add a confidence threshold check on retrieval similarity scores.

Stale knowledge base

Your chatbot is only as current as the last time you re-indexed your documents. Build automated ingestion pipelines from day one -- not a manual "upload documents" UI that gets forgotten after launch.

Context window overflow

Retrieving too many chunks bloats your prompt, increases cost, and paradoxically reduces quality as the model's attention dilutes. Start with top-3 chunks; increase only with evidence that more context improves answers.

The Stack We Ship With

  • Embeddings: OpenAI text-embedding-3-small (cost-effective) or text-embedding-3-large (higher accuracy)
  • Vector store: Supabase pgvector (MVPs), Pinecone (scale)
  • Re-ranker: Cohere Rerank v3
  • LLM: Claude Sonnet (best quality/cost for conversational use cases)
  • Orchestration: LangChain LCEL for chains, raw API calls for simple retrieval

RAG done right is not glamorous engineering. It is careful data pipeline work, obsessive evaluation on a golden test set, and incremental improvements measured with real user queries. The teams that win with it are the ones willing to do the unglamorous parts.