Fighting Overfitting

The Memorization Problem

🔄 Lesson 4 introduced the four main architectures — feedforward, CNN, RNN, and transformer. Each is powerful. But they all share a fundamental vulnerability: given enough capacity and training time, neural networks memorize training data instead of learning generalizable patterns.

This is overfitting — and it’s the single biggest challenge in deep learning practice.

Diagnosing the Problem

The diagnostic is simple: compare training performance to test performance.

Training Accuracy	Test Accuracy	Gap	Diagnosis
98%	95%	3%	Good — slight overfitting is normal
99%	78%	21%	Severe overfitting — memorizing training data
65%	62%	3%	Underfitting — model is too simple
65%	40%	25%	Both underfitting and overfitting — broken setup

A small gap (2-5%) is normal and expected. A large gap (15%+) means the model learned patterns specific to the training data that don’t generalize.

Dropout

What it does: During each training step, randomly “turn off” a percentage of neurons (typically 20-50%). Different neurons are disabled each time.

Why it works: Without dropout, neurons develop co-dependencies — a group of neurons might learn a pattern collaboratively, with each contributing a small piece. If one neuron is wrong, the whole group fails. Dropout forces each neuron to be independently useful because it can’t rely on any specific partner being present.

Analogy: A basketball team that practices with random players sitting out each game develops versatility. Every player learns to handle multiple positions. A team that always practices with the same lineup becomes fragile — lose one player and the whole system breaks.

Settings:

20-30% dropout for input layers
40-50% dropout for hidden layers
Lower dropout for smaller networks (to avoid underfitting)

✅ Quick Check: You apply 50% dropout and training accuracy drops from 99% to 85%. Is this a problem? No — that’s expected. Dropout intentionally reduces training accuracy by making the model work with only half its neurons at each step. The metric that matters is test accuracy. If test accuracy improves from 78% to 88%, the dropout is working — the model generalizes better despite lower training performance. Always evaluate regularization techniques by their effect on test performance, not training performance.

Batch Normalization

What it does: Normalizes the inputs to each layer so they have a consistent mean and variance, then lets the network learn the optimal scale and shift.

Why it works: As training progresses, the distribution of values flowing between layers shifts — a phenomenon called “internal covariate shift.” Each layer constantly adjusts to the changing distribution from the layer before it, which slows training and can lead to instability. Batch normalization stabilizes these distributions.

Benefits:

Training converges faster (often 2-3x speedup)
Allows higher learning rates (more aggressive updates without instability)
Provides a mild regularization effect (the batch statistics add noise)

Placement: The recommended order within a layer is: Linear/Conv → Batch Norm → Activation → Dropout.

L2 Regularization (Weight Decay)

What it does: Adds a penalty proportional to the size of the weights. Large weights get penalized more.

Why it works: Overfitting often involves large weight values — the model creates extreme, precise patterns that fit training data perfectly but are fragile. L2 regularization keeps weights small, favoring simpler, smoother functions that generalize better.

Analogy: A storyteller who must explain something using only simple words creates a more universally understandable explanation than one who uses jargon. The “simple words” constraint (small weights) forces generalizable solutions.

Data Augmentation

What it does: Creates new training examples by applying random transformations to existing ones.

For images:

Horizontal flip (cat facing left → cat facing right)
Random rotation (±10-15 degrees)
Random crop (slightly different framing)
Brightness/contrast adjustment
Color jitter (slight color shifts)

For text:

Synonym replacement (“happy” → “pleased”)
Random insertion/deletion of words
Back-translation (English → French → English)

For audio:

Time stretching (slightly faster or slower)
Pitch shifting
Background noise injection

Why it works: Each transformation creates a valid new training example. A horizontally flipped cat is still a cat. The network sees more variety during training, which reduces its ability to memorize specific examples and forces it to learn general patterns.

✅ Quick Check: Which of these augmentations would be WRONG for a dataset of handwritten digits? Horizontal flip — because a flipped 6 looks like a 9, and a flipped 7 doesn’t look like a valid digit. Augmentation must preserve the label. Rotation (small angles), slight scaling, and adding noise are all safe for digits. Always think about whether a transformation changes the meaning of the data.

Early Stopping

What it does: Monitor validation loss during training. When it starts increasing (even while training loss keeps decreasing), stop training.

Why it works: Increasing validation loss means the model has started memorizing training-specific patterns rather than learning general ones. Stopping at the right moment captures the best generalization performance.

In practice: Save a snapshot of the model weights whenever validation loss hits a new low. If validation loss hasn’t improved for a set number of epochs (the “patience” parameter, typically 5-10), stop and use the saved snapshot.

Combining Techniques

These techniques aren’t alternatives — they’re typically used together:

Technique	When to Use	Impact
Dropout	Almost always for dense layers	Strong regularization
Batch normalization	Almost always	Faster training, mild regularization
L2 regularization	When model has many parameters	Prevents extreme weights
Data augmentation	When dataset is small	Effectively increases data
Early stopping	Always	Catches optimal stopping point

Recommended stack: Batch Norm + Dropout + Data Augmentation + Early Stopping covers most overfitting scenarios.

Key Takeaways

Overfitting = large gap between training and test performance — the model memorized training data
Dropout randomly disables neurons during training, forcing robust independent learning
Batch normalization stabilizes layer inputs, speeds training 2-3x, and provides mild regularization
L2 regularization penalizes large weights, favoring simpler generalizable models
Data augmentation creates new training examples through transformations — effectively multiplying dataset size
Early stopping monitors validation loss and halts training before overfitting deepens
Combine techniques: Batch Norm + Dropout + Augmentation + Early Stopping is the standard stack

Up Next

Instead of training from scratch (which needs massive data and compute), what if you could start with a model that already knows how to see, read, or hear? Lesson 6 covers transfer learning — how pretrained models let you build with 1% of the data.