2/8

Lesson 2 12 min

Fine-Tuning Methods: SFT, RLHF, and DPO

Understand the three main fine-tuning methods: Supervised Fine-Tuning, RLHF, and DPO. Learn what each does, when to use it, and how the modern training pipeline works.

Premium Course Content

This lesson is part of a premium course. Upgrade to Pro to unlock all premium courses and content.

Access all premium courses
1000+ AI skill templates included
New content added weekly

← Back to course overview

In the last lesson, you learned when fine-tuning makes sense. Now the question is: how do you fine-tune? There are three main methods, and they’re often used together in a pipeline. Understanding each one helps you pick the right approach for your task — and avoid overcomplicating things.

What You’ll Learn

By the end of this lesson, you’ll understand the three core fine-tuning methods (SFT, RLHF, DPO), how they differ, and how they combine in the modern training pipeline.

The Modern Training Pipeline

Here’s how most production LLMs are built in 2026:

Pre-Training → SFT → DPO (or RLHF)

Pre-Training: Train on trillions of tokens of internet text. Creates a base model that can predict the next token. This is what companies like Meta, Mistral, and Google do. You won’t do this — it costs millions.

SFT (Supervised Fine-Tuning): Train on instruction-response pairs. The base model learns to follow instructions and produce useful outputs. This is the step you’ll most likely do.

DPO (or RLHF): Align the model with human preferences. The model learns which responses are better or worse. Optional but powerful for quality.

Most practical fine-tuning starts and ends at the SFT step. DPO is the next level when you need polish.

✅ Quick Check: If you fine-tune a base model (like Llama 3.2 base) directly with DPO, what happens? (It doesn’t work well. DPO trains preferences — “this response is better than that one.” But a base model barely follows instructions at all. You need SFT first to teach instruction-following, then DPO to refine which responses are preferred.)

Supervised Fine-Tuning (SFT)

SFT is the bread and butter. You give the model examples of ideal input-output pairs, and it learns to mimic that behavior.

What Your Training Data Looks Like

{
  "messages": [
    {"role": "system", "content": "You are a medical coding assistant."},
    {"role": "user", "content": "Patient presents with acute bronchitis, prescribed amoxicillin."},
    {"role": "assistant", "content": "ICD-10: J20.9 (Acute bronchitis, unspecified)\nCPT: 99213 (Office visit, established patient)\nRx: Amoxicillin 500mg TID x 10 days"}
  ]
}

Each example teaches the model: “When you see this kind of input, respond like this.” After training on hundreds or thousands of examples, the model generalizes the pattern.

How Many Examples Do You Need?

Task	Minimum	Sweet Spot
Style/tone transfer	50-100	200-500
Classification	200-500	1,000-2,000
Instruction following	500-1,000	2,000-5,000
Domain specialization	1,000+	5,000-10,000

And here’s the part that surprises people: 1,000 high-quality examples outperform 10,000 mediocre ones. Every time. Data quality beats quantity. We’ll cover how to build quality datasets in Lesson 5.

When SFT Is All You Need

For most practical tasks — customer support classification, structured data extraction, format standardization — SFT alone gets you 90% of the way. You don’t need RLHF or DPO. You need clean data and a solid training run.

RLHF: Reinforcement Learning from Human Feedback

RLHF is how ChatGPT became ChatGPT. It’s the alignment step that makes models helpful, harmless, and honest. But it’s also complex, expensive, and — for most teams — overkill.

How RLHF Works

Train a reward model: Collect human preference data (which response is better?) and train a separate model that scores responses
Run PPO: Use Proximal Policy Optimization (an RL algorithm) to adjust the LLM weights to maximize the reward model’s score
Iterate: Repeat, collecting more human feedback, refining the reward model

Why Most Teams Skip RLHF

Requires training two models (reward model + PPO policy)
PPO is notoriously unstable (sensitive to hyperparameters)
Needs expensive human annotators for preference data
The full pipeline requires significant ML engineering expertise

RLHF built GPT-4, Claude, and Llama 2-Chat. But those teams have hundreds of ML engineers and millions in compute budget. For everyone else, there’s DPO.

✅ Quick Check: What does the reward model in RLHF actually do? (It takes a prompt and a response, then outputs a scalar score indicating how “good” the response is. It’s trained on human preference data — pairs where annotators said “Response A is better than Response B.” The reward model learns to predict which responses humans prefer.)

DPO: The Practical Alternative

DPO (Direct Preference Optimization) arrived in 2023 and quickly became the standard for preference-based fine-tuning. It gives you most of RLHF’s benefits without the complexity.

How DPO Works

Instead of training a reward model and running PPO, DPO directly optimizes on preference pairs:

{
  "prompt": "Explain quantum computing to a 10-year-old",
  "chosen": "Imagine a magic coin that can be heads AND tails at the same time...",
  "rejected": "Quantum computing leverages superposition and entanglement principles..."
}

The model learns: “Generate responses more like the chosen one, less like the rejected one.” No reward model. No RL. Standard supervised learning tools.

SFT + DPO: The 2026 Standard

The combination is powerful:

SFT teaches the model what to do (follow instructions, produce the right format)
DPO teaches the model how to do it well (prefer clear over technical, concise over verbose)

OpenAI now offers DPO fine-tuning via their API. You can run SFT + DPO on GPT-4o-mini without any GPU at all.

DPO vs. RLHF: The Comparison

	RLHF	DPO
Needs reward model	Yes (separate model)	No
Training algorithm	PPO (RL)	Standard supervised learning
Stability	Sensitive to hyperparameters	More stable
Data required	Human preference pairs + RL episodes	Human preference pairs only
Compute cost	2-3x SFT	~1.5x SFT
Quality	Slightly better at scale	Matches or beats with limited compute
Who uses it	OpenAI, Anthropic, Meta	Everyone else (and OpenAI)

Picking Your Method

Your Situation	Method
“I need consistent output format”	SFT only
“I need domain-specific classification”	SFT only
“I want the model to sound like my brand”	SFT + DPO
“I need the model to avoid certain response patterns”	DPO (chosen vs. rejected)
“I’m building a general-purpose assistant”	SFT + DPO
“I have unlimited budget and ML engineers”	SFT + RLHF
“I want to use OpenAI’s API”	SFT → DPO (via API)

For this course, we’ll focus on SFT with QLoRA — the approach that gives you the most bang for your compute dollar. In Lesson 5, we’ll also cover how to create DPO preference pairs.

Key Takeaways

The modern training pipeline is Pre-Training → SFT → DPO (or RLHF)
SFT teaches instruction-following via input-output example pairs — it’s the most practical method
RLHF is powerful but complex — requires reward model + PPO, mostly used by frontier labs
DPO achieves similar results to RLHF with standard supervised learning tools — no reward model needed
For most tasks, SFT alone gets you 90% of the way. Add DPO when you need preference-level polish.
Quality of training data matters more than quantity: 1,000 great examples > 10,000 mediocre ones

Up Next

You know the methods. But fine-tuning a 7B model normally requires 100+ GB of VRAM — hardware that costs $50,000. In the next lesson, you’ll learn how LoRA and QLoRA make it possible on a $0 Colab GPU.

Knowledge Check

1. What's the main purpose of Supervised Fine-Tuning (SFT)?

To make the model generate more creative responses To teach the model to follow instructions and produce outputs in the desired format by training on example input-output pairs To reduce the model's parameter count

2. Why is DPO replacing RLHF in most practical fine-tuning pipelines?

DPO produces better models in every case DPO eliminates the need for a separate reward model and RL training loop — it directly optimizes on preference pairs using standard supervised learning DPO is older and more battle-tested

3. What does a preference pair look like in DPO training data?

A single input with a single correct output A prompt with two responses: one 'chosen' (preferred) and one 'rejected' (worse), so the model learns which style of response is better Two different prompts with the same output

Answer all questions to check

Complete the quiz above first

Related Skills

System Prompt Architect