Every product team building an AI feature eventually faces this question. The instinct is usually to fine-tune -- it sounds more rigorous, more "custom," more like you have built something proprietary. The reality is that fine-tuning is the right answer far less often than the question is asked.
What Each Technique Actually Does
RAG (Retrieval-Augmented Generation)
RAG leaves the base model unchanged. It adds a retrieval step that fetches relevant documents from your knowledge base and injects them into the prompt before generation. The model uses that injected context to answer.
Think of it as giving the model a textbook to look things up in. The model's reasoning ability is unchanged; only its access to information expands.
Fine-Tuning
Fine-tuning adjusts the weights of the model itself using your training data. It changes how the model reasons, responds, and formats output -- not just what information it has access to.
Think of it as teaching the model new skills or instilling new habits of response.
The Decision Framework
We run every client through a five-question framework before recommending an approach.
Question 1: Is the problem about knowledge or behaviour?
If your AI feature needs to answer questions based on your company's proprietary data (docs, products, policies), that is a knowledge problem. RAG solves knowledge problems.
If your feature needs to respond in a very specific style, follow complex domain-specific reasoning patterns, or perform a task the base model is bad at (medical coding, legal clause classification), that is a behaviour problem. Fine-tuning solves behaviour problems.
Question 2: How often does the underlying data change?
RAG is updated by updating your knowledge base -- immediate, no retraining. Fine-tuning is updated by running another training job -- expensive, slow, and requires re-evaluation.
If your knowledge base changes weekly or monthly, RAG is the only practical option. Fine-tuning for frequently-changing data is a maintenance trap.
Question 3: Do you have high-quality labelled training data?
Fine-tuning requires labelled examples: input-output pairs that demonstrate the correct behaviour. Getting 500-1,000 high-quality examples is not trivial. Getting 10,000 is a project in itself.
RAG requires no labelled data -- just your source documents in a searchable format.
Question 4: What is your time-to-ship requirement?
RAG can be prototyped in a day and shipped in two weeks. Fine-tuning takes weeks of data preparation, training, evaluation, and iteration before it is production-ready.
For a 90-day MVP, fine-tuning is almost never in scope unless it is the core product -- not a feature of the product.
Question 5: Does the model need to handle out-of-distribution inputs reliably?
Fine-tuned models can be brittle -- they perform well on inputs that look like their training data and can degrade unexpectedly on inputs that do not. RAG degrades more gracefully: if no relevant context is retrieved, it says so.
When Fine-Tuning Is the Right Answer
Fine-tuning makes sense when:
- You have a narrow, well-defined task with consistent input-output patterns
- The base model consistently fails at this task (not just occasionally)
- You have or can generate high-quality training data at scale
- Latency is critical and you need a smaller, faster model that performs like a large one on your specific task
- You need consistent output formatting that prompt engineering cannot reliably enforce
Real examples from our work: a fine-tuned model for ICD-10 medical code suggestion (narrow task, large labelled dataset, quality-critical); a fine-tuned model for extracting structured data from a specific legal document format (well-defined schema, consistent input type).
The Hybrid Approach
The most powerful production systems often combine both: a fine-tuned model that has learned to reason well about your domain, grounded with RAG for up-to-date factual retrieval. The fine-tuning handles the reasoning style; the RAG handles the facts.
But this is not a starting point. Start with RAG on a capable base model (Claude Sonnet, GPT-4o). Measure. If behaviour is the bottleneck -- not knowledge -- add fine-tuning then.
Default to RAG. Reach for fine-tuning only when RAG has demonstrably failed to solve the problem. In our experience across 13 products, that happens about 20% of the time.