Architectures
The four main deep learning architectures — feedforward, CNN, RNN, and transformer. What each does and when to use it.
Premium Course Content
This lesson is part of a premium course. Upgrade to Pro to unlock all premium courses and content.
- Access all premium courses
- 1000+ AI skill templates included
- New content added weekly
The Right Network for the Job
🔄 Lessons 2 and 3 covered how individual neurons work and how networks learn through backpropagation. But a basic feedforward network — where every neuron connects to every neuron in the next layer — isn’t ideal for every type of data.
Images have spatial structure. Text has sequential structure. Audio has temporal patterns. Specialized architectures exploit these structures to learn more effectively.
Feedforward Networks (Dense/MLP)
How they work: Every neuron in one layer connects to every neuron in the next. Data flows in one direction: input → hidden → output.
Strengths: Simple to understand, simple to implement, works for tabular data (spreadsheets, databases).
Weakness: No sense of structure. For images, a feedforward network treats pixel 1 and pixel 1000 the same — it doesn’t know that adjacent pixels are related. For text, it doesn’t know that word order matters.
Use when: Structured tabular data, simple classification tasks, as the final layers of more complex architectures.
CNNs (Convolutional Neural Networks)
The problem they solve: Images have spatial structure — nearby pixels relate to each other. A feedforward network ignores this, needing enormous numbers of parameters to process images.
How they work: Small filters (typically 3×3 or 5×5 pixels) slide across the image, detecting local patterns at each position:
- Convolutional layers: Filters detect features — edges, corners, textures
- Pooling layers: Reduce the spatial dimensions, keeping the important information
- Dense layers: Combine the detected features into a classification
Why convolution is clever: The same 3×3 filter scans the entire image. A “vertical edge” filter detects vertical edges whether they’re in the top-left, center, or bottom-right. This is called parameter sharing — instead of learning separate weights for each pixel position, the network learns one set of weights that applies everywhere.
Hierarchical features:
- Layer 1 filters detect edges (vertical, horizontal, diagonal)
- Layer 2 combines edges into textures and corners
- Layer 3 combines textures into shapes (circles, rectangles)
- Deeper layers combine shapes into objects (faces, cars, tumors)
✅ Quick Check: Why does a CNN need far fewer parameters than a feedforward network for image classification? A feedforward network connects every input pixel to every hidden neuron. For a 224×224 image (50,176 pixels) with a 1,000-neuron hidden layer, that’s 50 million connections in the first layer alone. A CNN uses small filters (3×3 = 9 weights) that slide across the entire image, reusing the same 9 weights at every position. With 64 filters, that’s just 576 parameters for the first layer — 100,000× fewer than the feedforward approach.
RNNs (Recurrent Neural Networks)
The problem they solve: Sequential data — text, time series, audio — where order matters. “The dog bit the man” means something very different from “The man bit the dog.”
How they work: RNNs maintain a “hidden state” — a memory of what they’ve seen so far. At each step, they take the current input AND the previous hidden state, producing a new hidden state and an output.
Step 1: Input "The" + initial state → hidden state₁
Step 2: Input "cat" + hidden state₁ → hidden state₂
Step 3: Input "sat" + hidden state₂ → hidden state₃
Each hidden state carries information about everything the network has seen up to that point.
The vanishing gradient problem: As sequences get longer, gradients computed during backpropagation shrink exponentially as they propagate back through time. By step 50, the gradient from step 1 is effectively zero — the network can’t learn connections between distant elements.
LSTM (Long Short-Term Memory): Solves the vanishing gradient problem with three gates:
- Forget gate: Decides what to discard from memory
- Input gate: Decides what new information to store
- Output gate: Decides what to output from the current state
These gates let LSTMs maintain information over hundreds of steps — a significant improvement over basic RNNs.
Transformers
The problem they solve: RNNs process sequentially (word by word), which is slow and loses information over long distances. Transformers process entire sequences simultaneously.
How they work: The self-attention mechanism computes relationships between every pair of elements in the input. For a 100-word sentence, it calculates 10,000 pairwise relationships — word 1 to word 1, word 1 to word 2, …, word 100 to word 100.
Why this matters: In “The cat that the dog chased ran up the tree,” attention connects “cat” directly to “ran” even though they’re separated by 5 words. No information degradation. No vanishing gradients.
The trade-off: Attention is computationally expensive. Quadratic complexity means doubling the sequence length quadruples the computation. This is why context windows have limits (128K, 200K tokens) — there’s a practical ceiling on how much attention can process.
Dominance: Transformers power every major language model (GPT-4, Claude, Gemini), vision models (ViT), and audio models. Their parallel processing maps perfectly to GPU hardware.
Architecture Comparison
| Feedforward | CNN | RNN/LSTM | Transformer | |
|---|---|---|---|---|
| Best for | Tabular data | Images, spatial | Time series, short sequences | Text, long sequences |
| Structure awareness | None | Spatial (2D) | Sequential | Global (attention) |
| Parallelizable | Yes | Yes | No (sequential) | Yes |
| Long-range dependency | N/A | Limited | Weak (RNN) / Better (LSTM) | Strong |
| Example models | Simple classifiers | ResNet, VGG | LSTM networks | GPT, BERT, ViT |
✅ Quick Check: Voice assistants (Siri, Alexa) convert speech audio into text. What architecture best handles the audio-to-text task? Transformers — modern speech recognition (Whisper by OpenAI) uses transformer architectures. Audio is represented as spectrograms (frequency over time), and the transformer’s self-attention captures both local acoustic patterns and long-range dependencies (connecting a word’s beginning to its end, maintaining context across phrases). Earlier systems used RNNs/LSTMs, but transformers achieved higher accuracy with faster training.
Key Takeaways
- Feedforward networks are simple but ignore data structure — use for tabular data
- CNNs use convolutional filters that slide across images, detecting spatial patterns with far fewer parameters than feedforward
- RNNs process sequences step-by-step with a hidden state — limited by vanishing gradients on long sequences
- LSTMs add gates to RNNs, solving the vanishing gradient problem for medium-length sequences
- Transformers use self-attention to process entire sequences at once — dominant for text, increasingly used for images and audio
- Architecture choice depends on data type, sequence length, speed requirements, and available compute
Up Next
These architectures are powerful — but they all face the same enemy: overfitting. Lesson 5 covers why neural networks memorize instead of learn, and the techniques that fix it — dropout, batch normalization, regularization, and data augmentation.
Knowledge Check
Complete the quiz above first
Lesson completed!