Convolutional Neural Networks
How CNNs detect visual patterns — convolutional filters, pooling layers, parameter sharing, and the architectures from LeNet to EfficientNet.
Premium Course Content
This lesson is part of a premium course. Upgrade to Pro to unlock all premium courses and content.
- Access all premium courses
- 1000+ AI skill templates included
- New content added weekly
The Architecture That Sees Patterns
🔄 Lesson 2 showed that images are just grids of numbers and that CV models learn hierarchical features — edges combine into textures, textures into shapes, shapes into objects. But how? The answer is the convolutional neural network, the architecture that transformed computer vision from a research curiosity into an industry standard.
The Convolution Operation
A convolution slides a small filter (typically 3×3 or 5×5 pixels) across the image, computing a weighted sum at each position. The filter acts as a pattern detector.
Example: Edge Detection
A simple vertical edge detector might use this 3×3 filter:
[-1 0 1]
[-1 0 1]
[-1 0 1]
When this filter slides over the image, it produces high values where vertical edges exist (pixels go from dark on the left to bright on the right) and low values in uniform regions. The filter detects vertical edges regardless of where they appear.
This is the key insight: the same filter scans the entire image, reusing its weights at every position. One set of 9 numbers detects vertical edges everywhere in the image. This is parameter sharing — and it’s why CNNs need dramatically fewer parameters than feedforward networks.
CNN Building Blocks
A CNN stacks three types of layers:
Convolutional layers apply multiple filters to detect different patterns. Layer 1 might have 64 filters — one for vertical edges, one for horizontal edges, one for diagonal gradients, and so on. The output is a set of feature maps — one map per filter — showing where each pattern was detected.
Pooling layers reduce spatial dimensions. Max pooling takes the maximum value from each 2×2 region, cutting the height and width in half. This does two things: reduces computation for subsequent layers and makes the detection slightly invariant to exact position (a feature detected at pixel 10 or pixel 11 maps to the same pooled output).
Fully connected layers (at the end) take the flattened feature maps and produce the final classification output.
✅ Quick Check: Why does a CNN with 64 filters in the first layer produce 64 feature maps? Each filter scans the entire image and produces one feature map — a 2D grid showing where that specific pattern was detected. 64 filters → 64 different patterns → 64 feature maps. A vertical edge filter produces a map highlighting vertical edges. A horizontal edge filter produces a different map highlighting horizontal edges. Together, these 64 maps capture 64 different aspects of the image’s visual structure.
Feature Hierarchies in Practice
As you stack convolutional layers, the features become progressively more complex:
| Depth | What Filters Detect | Example Patterns |
|---|---|---|
| Layer 1 | Simple edges and gradients | ─ │ ╲ ╱ |
| Layers 2-3 | Textures and corners | Grid patterns, circular edges, T-junctions |
| Layers 4-6 | Object parts | Eyes, wheels, fur texture, window frames |
| Layers 7+ | Full objects and scenes | Faces, cars, buildings, landscapes |
This hierarchical composition is automatic — the network discovers which features are useful through training, not through human engineering.
Landmark CNN Architectures
| Architecture | Year | Depth | Key Innovation | ImageNet Error |
|---|---|---|---|---|
| LeNet-5 | 1998 | 5 | First practical CNN (digit recognition) | — |
| AlexNet | 2012 | 8 | GPU training, ReLU, dropout | 16.4% |
| VGG | 2014 | 16-19 | Uniform 3×3 filters throughout | 7.3% |
| GoogLeNet | 2014 | 22 | Inception modules (parallel filter sizes) | 6.7% |
| ResNet | 2015 | 50-152 | Skip connections (residual learning) | 3.6% |
| EfficientNet | 2019 | variable | Compound scaling (width × depth × resolution) | 2.9% |
The trend: deeper networks with architectural innovations that solve the training challenges of depth. AlexNet (2012) proved deep learning worked for vision. ResNet (2015) proved you could go very deep. EfficientNet (2019) proved you could be both deep and efficient.
Vision Transformers: The Alternative
Vision Transformers (ViT), introduced in 2020, apply the transformer architecture (from NLP) to images. Instead of convolutional filters, ViT splits the image into patches (typically 16×16 pixels), treats each patch as a “token,” and uses self-attention to capture relationships between patches.
ViT vs CNN:
| Factor | CNN | Vision Transformer |
|---|---|---|
| Strength | Spatial structure, efficient | Global context from layer 1 |
| Small datasets | Better — built-in spatial bias | Struggles — needs 14M+ images |
| Large datasets | Good | Better — outperforms CNNs |
| Computation | More efficient | More expensive |
| Production use | Dominant (mature tooling) | Growing (hybrid approaches) |
The bottom line: CNNs remain the standard for most production CV tasks. ViTs win on large-scale benchmarks but require more data and compute. Hybrid architectures (CNN backbone + transformer head) are emerging as the best of both worlds.
✅ Quick Check: When would you choose a Vision Transformer over a CNN? When you have a large dataset (millions of images) and sufficient compute. ViTs outperform CNNs on benchmarks like COCO and ImageNet when trained on 14M+ images. But with smaller datasets (thousands to hundreds of thousands of images), CNNs with their built-in spatial biases (locality, translation equivariance) learn more efficiently. For most practical projects, start with a CNN (ResNet, EfficientNet) — switch to ViT only if data and compute justify it.
Key Takeaways
- CNNs use convolutional filters that slide across images, detecting patterns through parameter sharing — 87× fewer parameters than feedforward networks
- Feature hierarchy: edges → textures → parts → objects — each layer builds on the last
- Pooling reduces spatial dimensions and adds positional invariance
- ResNet’s skip connections enabled training networks with 100+ layers by solving gradient vanishing
- EfficientNet scales width, depth, and resolution together for optimal efficiency
- Vision Transformers outperform CNNs on large datasets but need more data and compute — CNNs remain the production standard
Up Next
CNNs answer “what is in this image?” But many applications need more: “where exactly are the objects?” Lesson 4 covers object detection — how models like YOLO and Faster R-CNN find and locate every object in an image, in real time.
Knowledge Check
Complete the quiz above first
Lesson completed!