Evaluation & Iteration
How to measure whether your fine-tuned model is actually better: held-out test sets, automated metrics, LLM-as-judge, and A/B comparisons.
Premium Course Content
This lesson is part of a premium course. Upgrade to Pro to unlock all premium courses and content.
- Access all premium courses
- 1000+ AI skill templates included
- New content added weekly
🔄 In the last lesson, you trained a model and watched the loss go down. Satisfying. But here’s what that number doesn’t tell you: is the model actually better at the task you care about? Loss is a proxy — a useful one, but not the final answer.
Evaluation is where you find out if your fine-tuning actually worked. And where you figure out what to fix if it didn’t.
What You’ll Learn
By the end of this lesson, you’ll have an evaluation framework — automated metrics, LLM-as-judge setup, and a comparison protocol for deciding whether your fine-tuned model beats the baseline.
The Evaluation Stack
There’s no single metric that tells you “your model is good.” You need a stack:
| Level | What It Measures | When to Use |
|---|---|---|
| Training loss | Did the model learn from the data? | During training (sanity check) |
| Held-out metrics | Does it generalize beyond training data? | After every training run |
| LLM-as-judge | Is the output actually good? | After promising runs |
| Human evaluation | Does a person agree? | Before deployment |
| Production metrics | Does it work in the real world? | After deployment |
You don’t need all five for every experiment. But skipping straight from loss to production is how bad models reach users.
Level 1: Held-Out Test Set
You split your data 80/20 in Lesson 5. Now use that 20%.
For Classification Tasks
If your model classifies text (sentiment, intent, category), the metrics are straightforward:
from sklearn.metrics import classification_report
# Run inference on test set
predictions = []
for example in test_set:
output = generate(model, example["input"])
predictions.append(output)
# Compare to ground truth
print(classification_report(
test_set["labels"],
predictions,
target_names=["Positive", "Negative", "Neutral"]
))
This gives you precision, recall, and F1 per category. Aim for F1 > 0.85 on your target categories.
For Generation Tasks
If your model generates text (responses, summaries, translations), automated metrics are harder. Common options:
| Metric | What It Measures | Good For |
|---|---|---|
| ROUGE | N-gram overlap with reference | Summaries, translations |
| BERTScore | Semantic similarity to reference | Any generation |
| Exact match | Does output exactly match expected? | Structured extraction |
| Format compliance | Does output follow the schema? | JSON, CSV, structured output |
For most practical fine-tuning, format compliance and BERTScore matter more than ROUGE. If you fine-tuned for JSON output, check: what percentage of test outputs are valid JSON? That’s your most actionable metric.
✅ Quick Check: Your fine-tuned model produces valid JSON on 95% of test examples, up from 60% with the base model. Is the fine-tuning working? (Yes — that’s a significant improvement for a format consistency task. This is exactly what fine-tuning excels at. The remaining 5% failures tell you where to add more training examples.)
Level 2: LLM-as-Judge
Automated metrics catch structural issues. But they don’t tell you if a response is actually good. That’s where LLM-as-judge comes in.
The Setup
Use a strong model (GPT-4, Claude) to evaluate your fine-tuned model’s outputs:
judge_prompt = """
Rate the following AI assistant response on a scale of 1-5 for each criterion:
**Accuracy**: Is the information correct?
**Helpfulness**: Does it answer the question?
**Tone**: Is the tone appropriate (professional, friendly)?
**Format**: Does it follow the expected structure?
User question: {question}
Assistant response: {response}
Provide scores as JSON: {"accuracy": X, "helpfulness": X, "tone": X, "format": X}
"""
Run this on 50-100 test examples. The aggregate scores tell you where your model excels and where it falls short.
Side-by-Side Comparison
Even more useful: have the judge compare your fine-tuned model against the base model:
comparison_prompt = """
Two AI assistants answered the same question. Which response is better?
Question: {question}
Response A: {base_model_response}
Response B: {finetuned_response}
Which is better (A or B) and why? Consider accuracy, helpfulness, and tone.
"""
Randomize which model is A and B to avoid position bias. If the judge picks your fine-tuned model 70%+ of the time, you’re making progress.
Calibrating the Judge
LLM judges aren’t perfect. They have biases:
- Verbosity bias — Longer responses tend to score higher
- Position bias — The first response tends to score higher
- Self-enhancement — GPT-4 rates GPT-4-style responses higher
Mitigation: Run 20 comparisons with humans too. If the judge and humans agree 80%+ of the time, the judge is reliable enough for your use case.
Level 3: The A/B Framework
For systematic comparison, run this protocol:
Step 1: Select 50 diverse test prompts (not cherry-picked)
Step 2: Generate responses from three sources:
- Base model (no fine-tuning)
- Fine-tuned model (your latest version)
- GPT-4/Claude (reference ceiling)
Step 3: Score each response (LLM-as-judge + human spot-check)
Step 4: Compare:
| Metric | Base Model | Fine-Tuned | GPT-4 Reference |
|---|---|---|---|
| Accuracy | 3.2 | 4.1 | 4.6 |
| Format compliance | 60% | 95% | 98% |
| Tone match | 2.8 | 4.3 | 3.5 |
| Avg judge score | 3.0 | 4.2 | 4.3 |
If your fine-tuned 3B model scores within 80-90% of GPT-4 on your specific task, that’s a major win — you’re getting near-GPT-4 quality at 10-100x lower cost and latency.
When Things Go Wrong
Common problems and fixes:
| Symptom | Likely Cause | Fix |
|---|---|---|
| Good on training, bad on test | Overfitting | Reduce epochs, increase data diversity, add dropout |
| Bad on everything | Insufficient data or wrong format | Check data quality, increase to 500+ examples |
| Great accuracy, wrong tone | Training data has right answers but wrong style | Add DPO preference pairs for tone |
| Model refuses to answer | Safety training from base model conflicting | Add examples where the model answers similar questions |
| Output cuts off mid-sentence | max_seq_length too short | Increase to 2048 or 4096 |
✅ Quick Check: Your model scores 4.5/5 on accuracy but 2.1/5 on tone in judge evaluation. You already have 1,000 SFT examples. What’s the next step? (Add DPO preference pairs. Create 200-300 pairs where the “chosen” response has the right tone and the “rejected” has the wrong tone. SFT teaches what to say; DPO teaches how to say it.)
The Iteration Loop
Fine-tuning is rarely one-and-done. Here’s the practical loop:
- Train → Evaluate → Identify weakest dimension
- Fix data → Add examples targeting the weak area
- Retrain → Evaluate again → Compare to previous version
- Repeat → Until evaluation metrics plateau
Most projects converge in 2-4 iterations. If you’re past iteration 5 with no improvement, the problem is usually data quality — not model size or hyperparameters.
Key Takeaways
- Training loss is a sanity check, not a quality measure — always evaluate on held-out data
- Use format compliance and BERTScore for generation tasks, F1 for classification
- LLM-as-judge (GPT-4/Claude) scales evaluation to hundreds of examples affordably
- Side-by-side comparisons (fine-tuned vs. base vs. reference) give the clearest signal
- Calibrate your judge against human ratings on a 20-example sample
- When metrics plateau, iterate on data quality — add examples targeting the weakest dimension
- 2-4 iterations is typical before a model is production-ready
Up Next
Your model passes evaluation. Now what? In the final lesson, you’ll learn production deployment — merging adapters, choosing a serving strategy, calculating costs, monitoring drift, and real-world SLM use cases that justify the whole pipeline.
Knowledge Check
Complete the quiz above first
Lesson completed!