Convolutional Neural Networks

The Architecture That Sees Patterns

🔄 Lesson 2 showed that images are just grids of numbers and that CV models learn hierarchical features — edges combine into textures, textures into shapes, shapes into objects. But how? The answer is the convolutional neural network, the architecture that transformed computer vision from a research curiosity into an industry standard.

The Convolution Operation

A convolution slides a small filter (typically 3×3 or 5×5 pixels) across the image, computing a weighted sum at each position. The filter acts as a pattern detector.

Example: Edge Detection

A simple vertical edge detector might use this 3×3 filter:

[-1  0  1]
[-1  0  1]
[-1  0  1]

When this filter slides over the image, it produces high values where vertical edges exist (pixels go from dark on the left to bright on the right) and low values in uniform regions. The filter detects vertical edges regardless of where they appear.

This is the key insight: the same filter scans the entire image, reusing its weights at every position. One set of 9 numbers detects vertical edges everywhere in the image. This is parameter sharing — and it’s why CNNs need dramatically fewer parameters than feedforward networks.

CNN Building Blocks

A CNN stacks three types of layers:

Convolutional layers apply multiple filters to detect different patterns. Layer 1 might have 64 filters — one for vertical edges, one for horizontal edges, one for diagonal gradients, and so on. The output is a set of feature maps — one map per filter — showing where each pattern was detected.

Pooling layers reduce spatial dimensions. Max pooling takes the maximum value from each 2×2 region, cutting the height and width in half. This does two things: reduces computation for subsequent layers and makes the detection slightly invariant to exact position (a feature detected at pixel 10 or pixel 11 maps to the same pooled output).

Fully connected layers (at the end) take the flattened feature maps and produce the final classification output.

✅ Quick Check: Why does a CNN with 64 filters in the first layer produce 64 feature maps? Each filter scans the entire image and produces one feature map — a 2D grid showing where that specific pattern was detected. 64 filters → 64 different patterns → 64 feature maps. A vertical edge filter produces a map highlighting vertical edges. A horizontal edge filter produces a different map highlighting horizontal edges. Together, these 64 maps capture 64 different aspects of the image’s visual structure.

Feature Hierarchies in Practice

As you stack convolutional layers, the features become progressively more complex:

Depth	What Filters Detect	Example Patterns
Layer 1	Simple edges and gradients	─ │ ╲ ╱
Layers 2-3	Textures and corners	Grid patterns, circular edges, T-junctions
Layers 4-6	Object parts	Eyes, wheels, fur texture, window frames
Layers 7+	Full objects and scenes	Faces, cars, buildings, landscapes

This hierarchical composition is automatic — the network discovers which features are useful through training, not through human engineering.

Landmark CNN Architectures

Architecture	Year	Depth	Key Innovation	ImageNet Error
LeNet-5	1998	5	First practical CNN (digit recognition)	—
AlexNet	2012	8	GPU training, ReLU, dropout	16.4%
VGG	2014	16-19	Uniform 3×3 filters throughout	7.3%
GoogLeNet	2014	22	Inception modules (parallel filter sizes)	6.7%
ResNet	2015	50-152	Skip connections (residual learning)	3.6%
EfficientNet	2019	variable	Compound scaling (width × depth × resolution)	2.9%

The trend: deeper networks with architectural innovations that solve the training challenges of depth. AlexNet (2012) proved deep learning worked for vision. ResNet (2015) proved you could go very deep. EfficientNet (2019) proved you could be both deep and efficient.

Vision Transformers: The Alternative

Vision Transformers (ViT), introduced in 2020, apply the transformer architecture (from NLP) to images. Instead of convolutional filters, ViT splits the image into patches (typically 16×16 pixels), treats each patch as a “token,” and uses self-attention to capture relationships between patches.

ViT vs CNN:

Factor	CNN	Vision Transformer
Strength	Spatial structure, efficient	Global context from layer 1
Small datasets	Better — built-in spatial bias	Struggles — needs 14M+ images
Large datasets	Good	Better — outperforms CNNs
Computation	More efficient	More expensive
Production use	Dominant (mature tooling)	Growing (hybrid approaches)

The bottom line: CNNs remain the standard for most production CV tasks. ViTs win on large-scale benchmarks but require more data and compute. Hybrid architectures (CNN backbone + transformer head) are emerging as the best of both worlds.

✅ Quick Check: When would you choose a Vision Transformer over a CNN? When you have a large dataset (millions of images) and sufficient compute. ViTs outperform CNNs on benchmarks like COCO and ImageNet when trained on 14M+ images. But with smaller datasets (thousands to hundreds of thousands of images), CNNs with their built-in spatial biases (locality, translation equivariance) learn more efficiently. For most practical projects, start with a CNN (ResNet, EfficientNet) — switch to ViT only if data and compute justify it.

Key Takeaways

CNNs use convolutional filters that slide across images, detecting patterns through parameter sharing — 87× fewer parameters than feedforward networks
Feature hierarchy: edges → textures → parts → objects — each layer builds on the last
Pooling reduces spatial dimensions and adds positional invariance
ResNet’s skip connections enabled training networks with 100+ layers by solving gradient vanishing
EfficientNet scales width, depth, and resolution together for optimal efficiency
Vision Transformers outperform CNNs on large datasets but need more data and compute — CNNs remain the production standard

Up Next

CNNs answer “what is in this image?” But many applications need more: “where exactly are the objects?” Lesson 4 covers object detection — how models like YOLO and Faster R-CNN find and locate every object in an image, in real time.