Model Evaluation
How to measure whether your ML model is actually good — accuracy, precision, recall, F1 score, overfitting, and the bias-variance tradeoff.
Premium Course Content
This lesson is part of a premium course. Upgrade to Pro to unlock all premium courses and content.
- Access all premium courses
- 1000+ AI skill templates included
- New content added weekly
Is Your Model Actually Good?
🔄 Lesson 4 covered the data pipeline — how data gets prepared and split. Now comes the critical question: once you’ve trained a model, how do you know it works?
The answer is surprisingly nuanced. A model that looks great on one metric can be terrible on another. And the most intuitive metric — accuracy — is often the most misleading.
Accuracy: The Obvious Metric (That Lies)
Accuracy = correct predictions / total predictions.
It seems logical: if the model gets 95 out of 100 predictions right, it’s 95% accurate. But accuracy breaks down with imbalanced data.
The class imbalance problem:
- 1,000 emails: 990 legitimate, 10 spam
- Model predicts every email is legitimate
- Accuracy: 990/1000 = 99%
- Spam caught: 0 out of 10
The model is 99% accurate and completely useless for its intended purpose. This is why ML practitioners rarely rely on accuracy alone.
Precision and Recall: The Metrics That Matter
Precision: When the model says “positive,” how often is it right?
Precision = True Positives / (True Positives + False Positives)
“Of all the emails I flagged as spam, what percentage actually were spam?”
Recall: Of all actual positives, how many did the model catch?
Recall = True Positives / (True Positives + False Negatives)
“Of all the actual spam emails, what percentage did I flag?”
The tension: Increasing one typically decreases the other.
| Scenario | Prioritize | Why |
|---|---|---|
| Cancer screening | Recall | Missing cancer (false negative) is life-threatening |
| Spam filtering | Precision | Blocking real email (false positive) loses business |
| Fraud detection | Recall | Missing fraud costs thousands per incident |
| Search results | Precision | Irrelevant results (false positives) frustrate users |
✅ Quick Check: A model screens job applications. It has high recall (catches 95% of qualified candidates) but low precision (only 40% of flagged applications are actually qualified). What happens in practice? HR reviews lots of unqualified candidates (60% of those flagged aren’t actually qualified), wasting review time. But few qualified candidates are missed (only 5% slip through). Is this acceptable? It depends on the hiring market. In a tight labor market where missing talent is costly, high recall is worth the extra review time. In an oversaturated market, higher precision (fewer wasted reviews) matters more.
F1 Score: Balancing Both
The F1 score is the harmonic mean of precision and recall — a single number that balances both:
F1 = 2 × (Precision × Recall) / (Precision + Recall)
F1 is highest when both precision and recall are high. If either one is low, F1 drops. This makes it a useful single metric when you care about both catching positives (recall) and avoiding false alarms (precision).
When to use F1: Imbalanced datasets where accuracy is misleading. Most classification problems in practice.
Overfitting vs Underfitting
The most fundamental challenge in ML: building a model that learns real patterns without memorizing noise.
Overfitting (too complex):
- Model memorizes training data, including noise
- Training accuracy: very high
- Test accuracy: significantly lower
- The model “studied the answer key” instead of learning the material
Underfitting (too simple):
- Model is too simple to capture the real patterns
- Training accuracy: low
- Test accuracy: also low
- The model “didn’t study enough”
The sweet spot: A model complex enough to capture real patterns but simple enough to generalize to new data.
How to detect overfitting:
- Compare training accuracy to test accuracy — a large gap signals overfitting
- Use cross-validation — high variance across folds suggests instability
- Plot learning curves — if training accuracy keeps rising while test accuracy plateaus or drops, overfitting is occurring
How to fix overfitting:
- Simplify the model (fewer parameters, shallower tree)
- Add regularization (penalties for complexity)
- Get more training data (harder to memorize larger datasets)
- Use dropout in neural networks (randomly disable neurons during training)
The Bias-Variance Tradeoff
This is the theoretical framework behind overfitting and underfitting:
Bias = how far off the model’s predictions are from reality (systematic error). High bias means the model is too simple — it underfits.
Variance = how much predictions change when you train on different data. High variance means the model is too sensitive to the specific training data — it overfits.
| Low Bias | High Bias | |
|---|---|---|
| Low Variance | Ideal model | Underfitting |
| High Variance | Overfitting | Worst case |
You can’t minimize both simultaneously — that’s the tradeoff. The goal is finding the complexity level where total error (bias + variance) is lowest.
✅ Quick Check: Model A scores 92% on training and 90% on testing. Model B scores 99% on training and 82% on testing. Which is better? Model A — the 2% gap between training and testing suggests it’s generalizing well. Model B’s 17% gap reveals severe overfitting. Even though B has higher training accuracy, A will perform better on real-world data. Always evaluate on test data, never training data.
Key Takeaways
- Accuracy is misleading with imbalanced data — a model predicting the majority class gets high accuracy while being useless
- Precision (when I say positive, am I right?) and recall (of all positives, how many did I catch?) capture what accuracy misses
- F1 score balances precision and recall into one metric — use it for imbalanced classification
- Overfitting (memorizing training data) shows as a gap between training and test performance — fix with simpler models, regularization, or more data
- Underfitting (too simple) shows as poor performance on both training and test — fix with more complex models or better features
- The bias-variance tradeoff is the fundamental tension: too simple underfits, too complex overfits
Up Next
You understand algorithms, data pipelines, and evaluation. Lesson 6 covers the software tools — scikit-learn, PyTorch, TensorFlow, and the Python ecosystem that makes ML practical.
Knowledge Check
Complete the quiz above first
Lesson completed!