Transfer Learning

Standing on the Shoulders of Giants

🔄 Lesson 5 covered overfitting — the challenge of training models that generalize. But what if you could skip most of the training entirely? What if someone already trained a model on 14 million images or 300 billion words, and you could use that as your starting point?

That’s transfer learning. And it’s how most deep learning gets done in practice.

The Core Idea

Training a deep learning model from scratch requires massive data and compute. ResNet was trained on 14 million images. GPT-4 was trained on trillions of tokens. BERT was trained on the entire English Wikipedia plus billions of web pages.

Transfer learning reuses these pretrained models for new tasks. Instead of learning to see (or read, or hear) from scratch, you start with a model that already can — then adapt it to your specific problem.

The result: 90-95% of custom model performance with 1% of the data and compute.

How It Works

A pretrained CNN trained on millions of images has already learned:

Early layers: Edges, gradients, corners, basic textures
Middle layers: Complex textures, patterns, shapes
Deep layers: Object parts — ears, wheels, eyes, windows

These features are general — they apply to almost any visual task. A “circle” detector is useful whether you’re classifying dogs, medical scans, or manufacturing defects.

Transfer learning takes these pretrained features and applies them to your specific task by modifying only the top layers.

Approach 1: Feature Extraction

How it works:

Take a pretrained model (e.g., ResNet trained on ImageNet)
Remove the final classification layer
Freeze all remaining layers — their weights don’t change
Add a new classifier trained on your data

The pretrained layers act as a fixed feature extractor. Your data flows through them, producing rich feature representations. The new classifier learns to map those features to your specific categories.

When to use: Small datasets (hundreds to a few thousand examples), or when your task is similar to what the model was originally trained on.

Advantages: Fast, low risk of overfitting, minimal compute needed.

Limitation: Can’t adapt the pretrained features to your domain — if your data is very different from the pretraining data (e.g., satellite imagery vs. natural photos), the fixed features might not capture what matters.

✅ Quick Check: A startup has 300 labeled chest X-rays and wants to detect pneumonia. Why is feature extraction (not training from scratch) the right approach? With 300 images, a CNN trained from scratch would overfit immediately — memorizing those 300 images rather than learning what pneumonia looks like. A pretrained model (like ResNet trained on ImageNet) already knows how to detect edges, textures, and shapes. The new classifier only needs to learn: “this combination of features = pneumonia.” That’s feasible with 300 examples because you’re training a small classifier on rich features, not an entire CNN from pixels.

Approach 2: Fine-Tuning

How it works:

Take a pretrained model
Add your new classifier (same as feature extraction)
Unfreeze the top layers — allow their weights to update
Train with a very low learning rate (10-100× lower than normal)

The early layers (general features like edges) stay frozen. The top layers get adapted to your specific data. The new classifier is trained from scratch.

When to use: Medium to large datasets (thousands to millions of examples), or when your domain differs from the pretraining domain.

Advantages: Higher accuracy than feature extraction because the model adapts its features to your specific task.

Risk: Catastrophic forgetting — if the learning rate is too high, the new training destroys the pretrained knowledge. Always use a low learning rate for fine-tuning.

Feature Extraction vs Fine-Tuning

Factor	Feature Extraction	Fine-Tuning
Dataset size	Small (100-1,000)	Medium+ (1,000+)
Compute needed	Low (minutes)	Medium (hours)
Risk of overfitting	Low	Medium (manage with low LR)
Accuracy potential	Good	Better
Domain similarity to pretrained	High	Can be lower
Complexity	Simple	Moderate

Rule of thumb: Start with feature extraction. If accuracy isn’t sufficient, try fine-tuning the top layers. Only fine-tune more layers if you have enough data to support it.

Common Pretrained Models

Model	Type	Pretrained On	Use For
ResNet	CNN	ImageNet (14M images)	Image classification, feature extraction
VGG	CNN	ImageNet	Image classification (simpler architecture)
BERT	Transformer	Wikipedia + Books	Text classification, NER, Q&A
GPT-4	Transformer	Web text (trillions of tokens)	Text generation, reasoning
ViT	Vision Transformer	ImageNet	Image classification (transformer-based)
Whisper	Transformer	680K hours of audio	Speech-to-text

Transfer Learning in Practice

Example: Medical imaging A hospital uses a ResNet pretrained on ImageNet, adds a new classifier, and fine-tunes on 5,000 labeled X-rays. Result: 94% accuracy on pneumonia detection — comparable to radiologist performance. Training from scratch on 5,000 images would produce ~60% accuracy at best.

Example: Sentiment analysis A company takes BERT (pretrained on general English text), adds a classification head, and fine-tunes on 20,000 customer reviews. Result: 92% accuracy across 5 sentiment categories. Training a language model from scratch would require millions of examples.

Example: Speech recognition An app uses OpenAI’s Whisper (pretrained on 680,000 hours of audio), fine-tunes on 1,000 hours of domain-specific recordings (medical dictation, legal transcription). Result: domain-specific accuracy that matches or exceeds general-purpose recognition.

✅ Quick Check: Why do early layers of a pretrained CNN (edge detectors, texture detectors) transfer well to almost any visual task, while later layers may not? Early layers learn universal visual features — every image has edges, textures, and basic shapes. These are useful regardless of domain. Later layers learn task-specific features — an “Imagenet dog ear detector” isn’t useful for classifying skin lesions. This is why fine-tuning focuses on adapting the later layers while keeping early layers frozen.

Key Takeaways

Transfer learning reuses pretrained models for new tasks — 90-95% of custom performance with 1% of the data
Feature extraction: freeze pretrained layers, train only a new classifier — best for small datasets
Fine-tuning: unfreeze top layers and adapt with a low learning rate — best for medium+ datasets
Catastrophic forgetting occurs when fine-tuning with too high a learning rate — always use 10-100× lower than training from scratch
Common pretrained models: ResNet (images), BERT (text), ViT (vision transformer), Whisper (audio)
Start with feature extraction; escalate to fine-tuning only if accuracy needs improvement

Up Next

You understand how deep learning models are built, trained, and adapted. Lesson 7 shows where they’re deployed — real-world applications across industries, plus the hardware and frameworks that make it all work.