Model Evaluation

Is Your Model Actually Good?

🔄 Lesson 4 covered the data pipeline — how data gets prepared and split. Now comes the critical question: once you’ve trained a model, how do you know it works?

The answer is surprisingly nuanced. A model that looks great on one metric can be terrible on another. And the most intuitive metric — accuracy — is often the most misleading.

Accuracy: The Obvious Metric (That Lies)

Accuracy = correct predictions / total predictions.

It seems logical: if the model gets 95 out of 100 predictions right, it’s 95% accurate. But accuracy breaks down with imbalanced data.

The class imbalance problem:

1,000 emails: 990 legitimate, 10 spam
Model predicts every email is legitimate
Accuracy: 990/1000 = 99%
Spam caught: 0 out of 10

The model is 99% accurate and completely useless for its intended purpose. This is why ML practitioners rarely rely on accuracy alone.

Precision and Recall: The Metrics That Matter

Precision: When the model says “positive,” how often is it right?

Precision = True Positives / (True Positives + False Positives)

“Of all the emails I flagged as spam, what percentage actually were spam?”

Recall: Of all actual positives, how many did the model catch?

Recall = True Positives / (True Positives + False Negatives)

“Of all the actual spam emails, what percentage did I flag?”

The tension: Increasing one typically decreases the other.

Scenario	Prioritize	Why
Cancer screening	Recall	Missing cancer (false negative) is life-threatening
Spam filtering	Precision	Blocking real email (false positive) loses business
Fraud detection	Recall	Missing fraud costs thousands per incident
Search results	Precision	Irrelevant results (false positives) frustrate users

✅ Quick Check: A model screens job applications. It has high recall (catches 95% of qualified candidates) but low precision (only 40% of flagged applications are actually qualified). What happens in practice? HR reviews lots of unqualified candidates (60% of those flagged aren’t actually qualified), wasting review time. But few qualified candidates are missed (only 5% slip through). Is this acceptable? It depends on the hiring market. In a tight labor market where missing talent is costly, high recall is worth the extra review time. In an oversaturated market, higher precision (fewer wasted reviews) matters more.

F1 Score: Balancing Both

The F1 score is the harmonic mean of precision and recall — a single number that balances both:

F1 = 2 × (Precision × Recall) / (Precision + Recall)

F1 is highest when both precision and recall are high. If either one is low, F1 drops. This makes it a useful single metric when you care about both catching positives (recall) and avoiding false alarms (precision).

When to use F1: Imbalanced datasets where accuracy is misleading. Most classification problems in practice.

Overfitting vs Underfitting

The most fundamental challenge in ML: building a model that learns real patterns without memorizing noise.

Overfitting (too complex):

Model memorizes training data, including noise
Training accuracy: very high
Test accuracy: significantly lower
The model “studied the answer key” instead of learning the material

Underfitting (too simple):

Model is too simple to capture the real patterns
Training accuracy: low
Test accuracy: also low
The model “didn’t study enough”

The sweet spot: A model complex enough to capture real patterns but simple enough to generalize to new data.

How to detect overfitting:

Compare training accuracy to test accuracy — a large gap signals overfitting
Use cross-validation — high variance across folds suggests instability
Plot learning curves — if training accuracy keeps rising while test accuracy plateaus or drops, overfitting is occurring

How to fix overfitting:

Simplify the model (fewer parameters, shallower tree)
Add regularization (penalties for complexity)
Get more training data (harder to memorize larger datasets)
Use dropout in neural networks (randomly disable neurons during training)

The Bias-Variance Tradeoff

This is the theoretical framework behind overfitting and underfitting:

Bias = how far off the model’s predictions are from reality (systematic error). High bias means the model is too simple — it underfits.

Variance = how much predictions change when you train on different data. High variance means the model is too sensitive to the specific training data — it overfits.

	Low Bias	High Bias
Low Variance	Ideal model	Underfitting
High Variance	Overfitting	Worst case

You can’t minimize both simultaneously — that’s the tradeoff. The goal is finding the complexity level where total error (bias + variance) is lowest.

✅ Quick Check: Model A scores 92% on training and 90% on testing. Model B scores 99% on training and 82% on testing. Which is better? Model A — the 2% gap between training and testing suggests it’s generalizing well. Model B’s 17% gap reveals severe overfitting. Even though B has higher training accuracy, A will perform better on real-world data. Always evaluate on test data, never training data.

Key Takeaways

Accuracy is misleading with imbalanced data — a model predicting the majority class gets high accuracy while being useless
Precision (when I say positive, am I right?) and recall (of all positives, how many did I catch?) capture what accuracy misses
F1 score balances precision and recall into one metric — use it for imbalanced classification
Overfitting (memorizing training data) shows as a gap between training and test performance — fix with simpler models, regularization, or more data
Underfitting (too simple) shows as poor performance on both training and test — fix with more complex models or better features
The bias-variance tradeoff is the fundamental tension: too simple underfits, too complex overfits

Up Next

You understand algorithms, data pipelines, and evaluation. Lesson 6 covers the software tools — scikit-learn, PyTorch, TensorFlow, and the Python ecosystem that makes ML practical.