Fine-Tuning Methods: SFT, RLHF, and DPO
Understand the three main fine-tuning methods: Supervised Fine-Tuning, RLHF, and DPO. Learn what each does, when to use it, and how the modern training pipeline works.
Premium Course Content
This lesson is part of a premium course. Upgrade to Pro to unlock all premium courses and content.
- Access all premium courses
- 1000+ AI skill templates included
- New content added weekly
In the last lesson, you learned when fine-tuning makes sense. Now the question is: how do you fine-tune? There are three main methods, and they’re often used together in a pipeline. Understanding each one helps you pick the right approach for your task — and avoid overcomplicating things.
What You’ll Learn
By the end of this lesson, you’ll understand the three core fine-tuning methods (SFT, RLHF, DPO), how they differ, and how they combine in the modern training pipeline.
The Modern Training Pipeline
Here’s how most production LLMs are built in 2026:
Pre-Training → SFT → DPO (or RLHF)
Pre-Training: Train on trillions of tokens of internet text. Creates a base model that can predict the next token. This is what companies like Meta, Mistral, and Google do. You won’t do this — it costs millions.
SFT (Supervised Fine-Tuning): Train on instruction-response pairs. The base model learns to follow instructions and produce useful outputs. This is the step you’ll most likely do.
DPO (or RLHF): Align the model with human preferences. The model learns which responses are better or worse. Optional but powerful for quality.
Most practical fine-tuning starts and ends at the SFT step. DPO is the next level when you need polish.
✅ Quick Check: If you fine-tune a base model (like Llama 3.2 base) directly with DPO, what happens? (It doesn’t work well. DPO trains preferences — “this response is better than that one.” But a base model barely follows instructions at all. You need SFT first to teach instruction-following, then DPO to refine which responses are preferred.)
Supervised Fine-Tuning (SFT)
SFT is the bread and butter. You give the model examples of ideal input-output pairs, and it learns to mimic that behavior.
What Your Training Data Looks Like
{
"messages": [
{"role": "system", "content": "You are a medical coding assistant."},
{"role": "user", "content": "Patient presents with acute bronchitis, prescribed amoxicillin."},
{"role": "assistant", "content": "ICD-10: J20.9 (Acute bronchitis, unspecified)\nCPT: 99213 (Office visit, established patient)\nRx: Amoxicillin 500mg TID x 10 days"}
]
}
Each example teaches the model: “When you see this kind of input, respond like this.” After training on hundreds or thousands of examples, the model generalizes the pattern.
How Many Examples Do You Need?
| Task | Minimum | Sweet Spot |
|---|---|---|
| Style/tone transfer | 50-100 | 200-500 |
| Classification | 200-500 | 1,000-2,000 |
| Instruction following | 500-1,000 | 2,000-5,000 |
| Domain specialization | 1,000+ | 5,000-10,000 |
And here’s the part that surprises people: 1,000 high-quality examples outperform 10,000 mediocre ones. Every time. Data quality beats quantity. We’ll cover how to build quality datasets in Lesson 5.
When SFT Is All You Need
For most practical tasks — customer support classification, structured data extraction, format standardization — SFT alone gets you 90% of the way. You don’t need RLHF or DPO. You need clean data and a solid training run.
RLHF: Reinforcement Learning from Human Feedback
RLHF is how ChatGPT became ChatGPT. It’s the alignment step that makes models helpful, harmless, and honest. But it’s also complex, expensive, and — for most teams — overkill.
How RLHF Works
- Train a reward model: Collect human preference data (which response is better?) and train a separate model that scores responses
- Run PPO: Use Proximal Policy Optimization (an RL algorithm) to adjust the LLM weights to maximize the reward model’s score
- Iterate: Repeat, collecting more human feedback, refining the reward model
Why Most Teams Skip RLHF
- Requires training two models (reward model + PPO policy)
- PPO is notoriously unstable (sensitive to hyperparameters)
- Needs expensive human annotators for preference data
- The full pipeline requires significant ML engineering expertise
RLHF built GPT-4, Claude, and Llama 2-Chat. But those teams have hundreds of ML engineers and millions in compute budget. For everyone else, there’s DPO.
✅ Quick Check: What does the reward model in RLHF actually do? (It takes a prompt and a response, then outputs a scalar score indicating how “good” the response is. It’s trained on human preference data — pairs where annotators said “Response A is better than Response B.” The reward model learns to predict which responses humans prefer.)
DPO: The Practical Alternative
DPO (Direct Preference Optimization) arrived in 2023 and quickly became the standard for preference-based fine-tuning. It gives you most of RLHF’s benefits without the complexity.
How DPO Works
Instead of training a reward model and running PPO, DPO directly optimizes on preference pairs:
{
"prompt": "Explain quantum computing to a 10-year-old",
"chosen": "Imagine a magic coin that can be heads AND tails at the same time...",
"rejected": "Quantum computing leverages superposition and entanglement principles..."
}
The model learns: “Generate responses more like the chosen one, less like the rejected one.” No reward model. No RL. Standard supervised learning tools.
SFT + DPO: The 2026 Standard
The combination is powerful:
- SFT teaches the model what to do (follow instructions, produce the right format)
- DPO teaches the model how to do it well (prefer clear over technical, concise over verbose)
OpenAI now offers DPO fine-tuning via their API. You can run SFT + DPO on GPT-4o-mini without any GPU at all.
DPO vs. RLHF: The Comparison
| RLHF | DPO | |
|---|---|---|
| Needs reward model | Yes (separate model) | No |
| Training algorithm | PPO (RL) | Standard supervised learning |
| Stability | Sensitive to hyperparameters | More stable |
| Data required | Human preference pairs + RL episodes | Human preference pairs only |
| Compute cost | 2-3x SFT | ~1.5x SFT |
| Quality | Slightly better at scale | Matches or beats with limited compute |
| Who uses it | OpenAI, Anthropic, Meta | Everyone else (and OpenAI) |
Picking Your Method
| Your Situation | Method |
|---|---|
| “I need consistent output format” | SFT only |
| “I need domain-specific classification” | SFT only |
| “I want the model to sound like my brand” | SFT + DPO |
| “I need the model to avoid certain response patterns” | DPO (chosen vs. rejected) |
| “I’m building a general-purpose assistant” | SFT + DPO |
| “I have unlimited budget and ML engineers” | SFT + RLHF |
| “I want to use OpenAI’s API” | SFT → DPO (via API) |
For this course, we’ll focus on SFT with QLoRA — the approach that gives you the most bang for your compute dollar. In Lesson 5, we’ll also cover how to create DPO preference pairs.
Key Takeaways
- The modern training pipeline is Pre-Training → SFT → DPO (or RLHF)
- SFT teaches instruction-following via input-output example pairs — it’s the most practical method
- RLHF is powerful but complex — requires reward model + PPO, mostly used by frontier labs
- DPO achieves similar results to RLHF with standard supervised learning tools — no reward model needed
- For most tasks, SFT alone gets you 90% of the way. Add DPO when you need preference-level polish.
- Quality of training data matters more than quantity: 1,000 great examples > 10,000 mediocre ones
Up Next
You know the methods. But fine-tuning a 7B model normally requires 100+ GB of VRAM — hardware that costs $50,000. In the next lesson, you’ll learn how LoRA and QLoRA make it possible on a $0 Colab GPU.
Knowledge Check
Complete the quiz above first
Lesson completed!