1/8

Lesson 1 10 min

Why Fine-Tune? The Decision Framework

Learn when to fine-tune an LLM vs. use RAG vs. prompt engineering. A practical decision framework with real cost comparisons.

Premium Course Content

This lesson is part of a premium course. Upgrade to Pro to unlock all premium courses and content.

Access all premium courses
1000+ AI skill templates included
New content added weekly

← Back to course overview

Here’s a question that trips up most teams: you’ve got an LLM that’s almost good enough for your use case. The output format is inconsistent. The tone doesn’t match your brand. It hallucinates on domain-specific terms. Do you fine-tune, build a RAG pipeline, or write a better prompt?

Most people guess wrong. And “guess wrong” means burning weeks of engineering time on a solution that doesn’t move the needle.

What You’ll Learn

By the end of this lesson, you’ll have a clear decision framework for when to fine-tune, when to use RAG, and when prompt engineering is all you need.

Objectives

Evaluate the three approaches (prompt engineering, RAG, fine-tuning) against specific criteria
Identify which problems fine-tuning actually solves vs. which it doesn’t

How This Course Works

This is an 8-lesson course that takes you from zero to a working fine-tuned model. Each lesson builds on the previous one. You’ll need a Google account (for Colab) and comfort with Python — but no ML experience. We explain everything from first principles.

The Three Approaches

Approach	What It Changes	Time	Cost	Best For
Prompt Engineering	How you ask	Hours	~$0	Quick wins, flexible tasks
RAG	What the model knows	Days-weeks	$70-1,000/mo	Dynamic knowledge, real-time data
Fine-Tuning	How the model behaves	Weeks	High upfront, lower inference	Style, format, classification

That table is a starting point. But the real decision is messier. Let’s look at each one.

Prompt Engineering: When It’s Enough

Prompt engineering is your first move. Always. It’s free, it’s fast, and for many tasks, it’s all you need.

Good prompt engineering handles:

Few-shot examples for output formatting
System prompts for tone and persona
Chain-of-thought for reasoning tasks
JSON mode for structured output

Where prompt engineering breaks down:

Consistency at scale — the model “forgets” instructions on the 500th call
Token costs — long system prompts eat your budget on every API call
Latency — 2,000 tokens of prompt context adds ~1s of processing
Style drift — the model gradually reverts to its default voice

✅ Quick Check: If your model gives inconsistent JSON formatting across 1,000 API calls despite detailed instructions, is that a prompt engineering problem or a fine-tuning problem? (Fine-tuning. Inconsistent behavior at scale is exactly what fine-tuning fixes — it bakes the desired format into the model weights so you don’t need to re-explain it every call.)

RAG: When Knowledge Is the Problem

RAG injects documents into the prompt at query time. It answers the question: “How do I get the model to know about my stuff?”

RAG wins when:

Data updates frequently (product catalog, regulations, news)
You need source attribution (“According to policy document 4.2…”)
Your knowledge base is smaller than ~200K tokens (just use long context)
You’re building Q&A over documents

RAG loses when:

The problem is behavior, not knowledge
You need the model to consistently format output
Classification accuracy matters more than knowledge breadth
Latency is critical (RAG adds retrieval + embedding time)

Fine-Tuning: When Behavior Is the Problem

Fine-tuning changes the model’s weights. It doesn’t add new knowledge in the way you might expect — it changes how the model behaves.

Fine-tuning wins when:

You need consistent output format/schema (always JSON, always specific fields)
You want specialized style (brand voice, medical terminology, legal tone)
Classification tasks (sentiment, intent routing, content moderation)
You can replace a large model with a smaller fine-tuned one (cost + latency savings)
Privacy requires an on-premise model (no data leaves your servers)

Fine-tuning loses when:

Your data changes frequently (that’s RAG’s job)
You need general-purpose flexibility (fine-tuning narrows capability)
You have fewer than 50-100 high-quality examples
The problem is solvable with a better prompt (test this first!)

✅ Quick Check: You want your LLM to always respond in your company’s brand voice — informal, witty, uses specific industry jargon. Is this a fine-tuning or RAG problem? (Fine-tuning. Brand voice is behavior, not knowledge. You’d fine-tune on examples of ideal responses in your brand voice. RAG can’t teach tone.)

The Hybrid Approach: 2026 Best Practice

The smartest teams don’t choose one approach — they combine them:

Fine-tune for style, behavior, and format
RAG for up-to-date factual knowledge
Prompt engineering for task-specific instructions

Real example: A customer support system that:

Fine-tunes Mistral 7B on 2,000 examples of ideal support responses (tone, format)
Uses RAG to pull product documentation and pricing at query time
System prompt specifies the current promotion or seasonal context

This hybrid approach is why a fine-tuned 7B model with RAG often beats a prompted 70B model — lower cost, lower latency, better task performance.

The Numbers That Matter

Metric	Prompt Eng.	RAG	Fine-Tuning
Setup time	Hours	1-2 weeks	2-4 weeks
Ongoing cost	High (long prompts)	Medium (embeddings + retrieval)	Low (shorter prompts, smaller model)
Upfront cost	~$0	$500-5,000 (infrastructure)	$50-5,000 (compute)
Latency	Baseline + prompt tokens	+200-500ms (retrieval)	Same or faster (smaller model)
Knowledge freshness	Real-time	Near real-time	Static (needs retraining)
Behavior consistency	Medium	Medium	High

Key Takeaways

Fine-tuning changes behavior (style, format, classification). RAG changes knowledge. Don’t confuse them.
Always try prompt engineering first — if a better prompt solves it, don’t fine-tune
The hybrid approach (fine-tune for behavior + RAG for knowledge) is the 2026 standard
Fine-tuned 3B-7B models regularly outperform 70B+ base models on specific tasks
Fine-tuning costs are dropping fast — QLoRA on a free Colab GPU is now possible

Up Next

Now that you know when to fine-tune, let’s learn how. In the next lesson, we’ll break down the methods: Supervised Fine-Tuning, RLHF, and DPO — what each does, how they differ, and when you’d pick one over another.

Knowledge Check

1. When is fine-tuning the right choice over RAG?

When you need the model to access real-time data When you need consistent output format, specialized style, or classification behavior baked into the model weights When you have a small budget and no GPU access

2. A fine-tuned 3B model can outperform a 70B base model on specific tasks. Why?

Because smaller models are always better Because fine-tuning bakes task-specific behavior into the weights — the small model becomes a specialist while the large model remains a generalist Because 3B models have better architectures

3. What's the 2026 best practice for combining fine-tuning and RAG?

Use fine-tuning for everything — RAG is outdated Fine-tune for style and behavior, RAG for up-to-date factual knowledge Use RAG for everything — fine-tuning is too expensive

Answer all questions to check

Complete the quiz above first

Related Skills

System Prompt Architect