Transfer Learning
How pretrained models let you build deep learning systems with 1% of the data — feature extraction, fine-tuning, and when to use each approach.
Premium Course Content
This lesson is part of a premium course. Upgrade to Pro to unlock all premium courses and content.
- Access all premium courses
- 1000+ AI skill templates included
- New content added weekly
Standing on the Shoulders of Giants
🔄 Lesson 5 covered overfitting — the challenge of training models that generalize. But what if you could skip most of the training entirely? What if someone already trained a model on 14 million images or 300 billion words, and you could use that as your starting point?
That’s transfer learning. And it’s how most deep learning gets done in practice.
The Core Idea
Training a deep learning model from scratch requires massive data and compute. ResNet was trained on 14 million images. GPT-4 was trained on trillions of tokens. BERT was trained on the entire English Wikipedia plus billions of web pages.
Transfer learning reuses these pretrained models for new tasks. Instead of learning to see (or read, or hear) from scratch, you start with a model that already can — then adapt it to your specific problem.
The result: 90-95% of custom model performance with 1% of the data and compute.
How It Works
A pretrained CNN trained on millions of images has already learned:
- Early layers: Edges, gradients, corners, basic textures
- Middle layers: Complex textures, patterns, shapes
- Deep layers: Object parts — ears, wheels, eyes, windows
These features are general — they apply to almost any visual task. A “circle” detector is useful whether you’re classifying dogs, medical scans, or manufacturing defects.
Transfer learning takes these pretrained features and applies them to your specific task by modifying only the top layers.
Approach 1: Feature Extraction
How it works:
- Take a pretrained model (e.g., ResNet trained on ImageNet)
- Remove the final classification layer
- Freeze all remaining layers — their weights don’t change
- Add a new classifier trained on your data
The pretrained layers act as a fixed feature extractor. Your data flows through them, producing rich feature representations. The new classifier learns to map those features to your specific categories.
When to use: Small datasets (hundreds to a few thousand examples), or when your task is similar to what the model was originally trained on.
Advantages: Fast, low risk of overfitting, minimal compute needed.
Limitation: Can’t adapt the pretrained features to your domain — if your data is very different from the pretraining data (e.g., satellite imagery vs. natural photos), the fixed features might not capture what matters.
✅ Quick Check: A startup has 300 labeled chest X-rays and wants to detect pneumonia. Why is feature extraction (not training from scratch) the right approach? With 300 images, a CNN trained from scratch would overfit immediately — memorizing those 300 images rather than learning what pneumonia looks like. A pretrained model (like ResNet trained on ImageNet) already knows how to detect edges, textures, and shapes. The new classifier only needs to learn: “this combination of features = pneumonia.” That’s feasible with 300 examples because you’re training a small classifier on rich features, not an entire CNN from pixels.
Approach 2: Fine-Tuning
How it works:
- Take a pretrained model
- Add your new classifier (same as feature extraction)
- Unfreeze the top layers — allow their weights to update
- Train with a very low learning rate (10-100× lower than normal)
The early layers (general features like edges) stay frozen. The top layers get adapted to your specific data. The new classifier is trained from scratch.
When to use: Medium to large datasets (thousands to millions of examples), or when your domain differs from the pretraining domain.
Advantages: Higher accuracy than feature extraction because the model adapts its features to your specific task.
Risk: Catastrophic forgetting — if the learning rate is too high, the new training destroys the pretrained knowledge. Always use a low learning rate for fine-tuning.
Feature Extraction vs Fine-Tuning
| Factor | Feature Extraction | Fine-Tuning |
|---|---|---|
| Dataset size | Small (100-1,000) | Medium+ (1,000+) |
| Compute needed | Low (minutes) | Medium (hours) |
| Risk of overfitting | Low | Medium (manage with low LR) |
| Accuracy potential | Good | Better |
| Domain similarity to pretrained | High | Can be lower |
| Complexity | Simple | Moderate |
Rule of thumb: Start with feature extraction. If accuracy isn’t sufficient, try fine-tuning the top layers. Only fine-tune more layers if you have enough data to support it.
Common Pretrained Models
| Model | Type | Pretrained On | Use For |
|---|---|---|---|
| ResNet | CNN | ImageNet (14M images) | Image classification, feature extraction |
| VGG | CNN | ImageNet | Image classification (simpler architecture) |
| BERT | Transformer | Wikipedia + Books | Text classification, NER, Q&A |
| GPT-4 | Transformer | Web text (trillions of tokens) | Text generation, reasoning |
| ViT | Vision Transformer | ImageNet | Image classification (transformer-based) |
| Whisper | Transformer | 680K hours of audio | Speech-to-text |
Transfer Learning in Practice
Example: Medical imaging A hospital uses a ResNet pretrained on ImageNet, adds a new classifier, and fine-tunes on 5,000 labeled X-rays. Result: 94% accuracy on pneumonia detection — comparable to radiologist performance. Training from scratch on 5,000 images would produce ~60% accuracy at best.
Example: Sentiment analysis A company takes BERT (pretrained on general English text), adds a classification head, and fine-tunes on 20,000 customer reviews. Result: 92% accuracy across 5 sentiment categories. Training a language model from scratch would require millions of examples.
Example: Speech recognition An app uses OpenAI’s Whisper (pretrained on 680,000 hours of audio), fine-tunes on 1,000 hours of domain-specific recordings (medical dictation, legal transcription). Result: domain-specific accuracy that matches or exceeds general-purpose recognition.
✅ Quick Check: Why do early layers of a pretrained CNN (edge detectors, texture detectors) transfer well to almost any visual task, while later layers may not? Early layers learn universal visual features — every image has edges, textures, and basic shapes. These are useful regardless of domain. Later layers learn task-specific features — an “Imagenet dog ear detector” isn’t useful for classifying skin lesions. This is why fine-tuning focuses on adapting the later layers while keeping early layers frozen.
Key Takeaways
- Transfer learning reuses pretrained models for new tasks — 90-95% of custom performance with 1% of the data
- Feature extraction: freeze pretrained layers, train only a new classifier — best for small datasets
- Fine-tuning: unfreeze top layers and adapt with a low learning rate — best for medium+ datasets
- Catastrophic forgetting occurs when fine-tuning with too high a learning rate — always use 10-100× lower than training from scratch
- Common pretrained models: ResNet (images), BERT (text), ViT (vision transformer), Whisper (audio)
- Start with feature extraction; escalate to fine-tuning only if accuracy needs improvement
Up Next
You understand how deep learning models are built, trained, and adapted. Lesson 7 shows where they’re deployed — real-world applications across industries, plus the hardware and frameworks that make it all work.
Knowledge Check
Complete the quiz above first
Lesson completed!