7/8

Lesson 7 12 min

Evaluation & Iteration

How to measure whether your fine-tuned model is actually better: held-out test sets, automated metrics, LLM-as-judge, and A/B comparisons.

Premium Course Content

This lesson is part of a premium course. Upgrade to Pro to unlock all premium courses and content.

Access all premium courses
1000+ AI skill templates included
New content added weekly

← Back to course overview

🔄 In the last lesson, you trained a model and watched the loss go down. Satisfying. But here’s what that number doesn’t tell you: is the model actually better at the task you care about? Loss is a proxy — a useful one, but not the final answer.

Evaluation is where you find out if your fine-tuning actually worked. And where you figure out what to fix if it didn’t.

What You’ll Learn

By the end of this lesson, you’ll have an evaluation framework — automated metrics, LLM-as-judge setup, and a comparison protocol for deciding whether your fine-tuned model beats the baseline.

The Evaluation Stack

There’s no single metric that tells you “your model is good.” You need a stack:

Level	What It Measures	When to Use
Training loss	Did the model learn from the data?	During training (sanity check)
Held-out metrics	Does it generalize beyond training data?	After every training run
LLM-as-judge	Is the output actually good?	After promising runs
Human evaluation	Does a person agree?	Before deployment
Production metrics	Does it work in the real world?	After deployment

You don’t need all five for every experiment. But skipping straight from loss to production is how bad models reach users.

Level 1: Held-Out Test Set

You split your data 80/20 in Lesson 5. Now use that 20%.

For Classification Tasks

If your model classifies text (sentiment, intent, category), the metrics are straightforward:

from sklearn.metrics import classification_report

# Run inference on test set
predictions = []
for example in test_set:
    output = generate(model, example["input"])
    predictions.append(output)

# Compare to ground truth
print(classification_report(
    test_set["labels"],
    predictions,
    target_names=["Positive", "Negative", "Neutral"]
))

This gives you precision, recall, and F1 per category. Aim for F1 > 0.85 on your target categories.

For Generation Tasks

If your model generates text (responses, summaries, translations), automated metrics are harder. Common options:

Metric	What It Measures	Good For
ROUGE	N-gram overlap with reference	Summaries, translations
BERTScore	Semantic similarity to reference	Any generation
Exact match	Does output exactly match expected?	Structured extraction
Format compliance	Does output follow the schema?	JSON, CSV, structured output

For most practical fine-tuning, format compliance and BERTScore matter more than ROUGE. If you fine-tuned for JSON output, check: what percentage of test outputs are valid JSON? That’s your most actionable metric.

✅ Quick Check: Your fine-tuned model produces valid JSON on 95% of test examples, up from 60% with the base model. Is the fine-tuning working? (Yes — that’s a significant improvement for a format consistency task. This is exactly what fine-tuning excels at. The remaining 5% failures tell you where to add more training examples.)

Level 2: LLM-as-Judge

Automated metrics catch structural issues. But they don’t tell you if a response is actually good. That’s where LLM-as-judge comes in.

The Setup

Use a strong model (GPT-4, Claude) to evaluate your fine-tuned model’s outputs:

judge_prompt = """
Rate the following AI assistant response on a scale of 1-5 for each criterion:

**Accuracy**: Is the information correct?
**Helpfulness**: Does it answer the question?
**Tone**: Is the tone appropriate (professional, friendly)?
**Format**: Does it follow the expected structure?

User question: {question}
Assistant response: {response}

Provide scores as JSON: {"accuracy": X, "helpfulness": X, "tone": X, "format": X}
"""

Run this on 50-100 test examples. The aggregate scores tell you where your model excels and where it falls short.

Side-by-Side Comparison

Even more useful: have the judge compare your fine-tuned model against the base model:

comparison_prompt = """
Two AI assistants answered the same question. Which response is better?

Question: {question}

Response A: {base_model_response}
Response B: {finetuned_response}

Which is better (A or B) and why? Consider accuracy, helpfulness, and tone.
"""

Randomize which model is A and B to avoid position bias. If the judge picks your fine-tuned model 70%+ of the time, you’re making progress.

Calibrating the Judge

LLM judges aren’t perfect. They have biases:

Verbosity bias — Longer responses tend to score higher
Position bias — The first response tends to score higher
Self-enhancement — GPT-4 rates GPT-4-style responses higher

Mitigation: Run 20 comparisons with humans too. If the judge and humans agree 80%+ of the time, the judge is reliable enough for your use case.

Level 3: The A/B Framework

For systematic comparison, run this protocol:

Step 1: Select 50 diverse test prompts (not cherry-picked)

Step 2: Generate responses from three sources:

Base model (no fine-tuning)
Fine-tuned model (your latest version)
GPT-4/Claude (reference ceiling)

Step 3: Score each response (LLM-as-judge + human spot-check)

Step 4: Compare:

Metric	Base Model	Fine-Tuned	GPT-4 Reference
Accuracy	3.2	4.1	4.6
Format compliance	60%	95%	98%
Tone match	2.8	4.3	3.5
Avg judge score	3.0	4.2	4.3

If your fine-tuned 3B model scores within 80-90% of GPT-4 on your specific task, that’s a major win — you’re getting near-GPT-4 quality at 10-100x lower cost and latency.

When Things Go Wrong

Common problems and fixes:

Symptom	Likely Cause	Fix
Good on training, bad on test	Overfitting	Reduce epochs, increase data diversity, add dropout
Bad on everything	Insufficient data or wrong format	Check data quality, increase to 500+ examples
Great accuracy, wrong tone	Training data has right answers but wrong style	Add DPO preference pairs for tone
Model refuses to answer	Safety training from base model conflicting	Add examples where the model answers similar questions
Output cuts off mid-sentence	`max_seq_length` too short	Increase to 2048 or 4096

✅ Quick Check: Your model scores 4.5/5 on accuracy but 2.1/5 on tone in judge evaluation. You already have 1,000 SFT examples. What’s the next step? (Add DPO preference pairs. Create 200-300 pairs where the “chosen” response has the right tone and the “rejected” has the wrong tone. SFT teaches what to say; DPO teaches how to say it.)

The Iteration Loop

Fine-tuning is rarely one-and-done. Here’s the practical loop:

Train → Evaluate → Identify weakest dimension
Fix data → Add examples targeting the weak area
Retrain → Evaluate again → Compare to previous version
Repeat → Until evaluation metrics plateau

Most projects converge in 2-4 iterations. If you’re past iteration 5 with no improvement, the problem is usually data quality — not model size or hyperparameters.

Key Takeaways

Training loss is a sanity check, not a quality measure — always evaluate on held-out data
Use format compliance and BERTScore for generation tasks, F1 for classification
LLM-as-judge (GPT-4/Claude) scales evaluation to hundreds of examples affordably
Side-by-side comparisons (fine-tuned vs. base vs. reference) give the clearest signal
Calibrate your judge against human ratings on a 20-example sample
When metrics plateau, iterate on data quality — add examples targeting the weakest dimension
2-4 iterations is typical before a model is production-ready

Up Next

Your model passes evaluation. Now what? In the final lesson, you’ll learn production deployment — merging adapters, choosing a serving strategy, calculating costs, monitoring drift, and real-world SLM use cases that justify the whole pipeline.

Knowledge Check

1. Your training loss dropped from 1.4 to 0.3. Does this mean your fine-tuned model is good?

Yes — lower loss always means a better model Not necessarily — low training loss only means the model fits the training data well. You need to evaluate on held-out examples to know if it generalizes. No — loss should never drop that much

2. What is LLM-as-judge evaluation?

Using a human judge who specializes in language models Using a strong model (GPT-4, Claude) to score and compare your fine-tuned model's outputs against a baseline, replacing or supplementing human evaluation A legal review process for AI models

3. You fine-tuned a support chatbot. It scores well on your test set but users complain about 'robotic' responses. What's the most likely issue?

The model is too small Your test set doesn't capture what users actually care about — the evaluation metrics don't measure tone, empathy, or conversational flow Fine-tuning doesn't work for chatbots

Answer all questions to check

Complete the quiz above first

Related Skills

Code Review Assistant