Architectures

The Right Network for the Job

🔄 Lessons 2 and 3 covered how individual neurons work and how networks learn through backpropagation. But a basic feedforward network — where every neuron connects to every neuron in the next layer — isn’t ideal for every type of data.

Images have spatial structure. Text has sequential structure. Audio has temporal patterns. Specialized architectures exploit these structures to learn more effectively.

Feedforward Networks (Dense/MLP)

How they work: Every neuron in one layer connects to every neuron in the next. Data flows in one direction: input → hidden → output.

Strengths: Simple to understand, simple to implement, works for tabular data (spreadsheets, databases).

Weakness: No sense of structure. For images, a feedforward network treats pixel 1 and pixel 1000 the same — it doesn’t know that adjacent pixels are related. For text, it doesn’t know that word order matters.

Use when: Structured tabular data, simple classification tasks, as the final layers of more complex architectures.

CNNs (Convolutional Neural Networks)

The problem they solve: Images have spatial structure — nearby pixels relate to each other. A feedforward network ignores this, needing enormous numbers of parameters to process images.

How they work: Small filters (typically 3×3 or 5×5 pixels) slide across the image, detecting local patterns at each position:

Convolutional layers: Filters detect features — edges, corners, textures
Pooling layers: Reduce the spatial dimensions, keeping the important information
Dense layers: Combine the detected features into a classification

Why convolution is clever: The same 3×3 filter scans the entire image. A “vertical edge” filter detects vertical edges whether they’re in the top-left, center, or bottom-right. This is called parameter sharing — instead of learning separate weights for each pixel position, the network learns one set of weights that applies everywhere.

Hierarchical features:

Layer 1 filters detect edges (vertical, horizontal, diagonal)
Layer 2 combines edges into textures and corners
Layer 3 combines textures into shapes (circles, rectangles)
Deeper layers combine shapes into objects (faces, cars, tumors)

✅ Quick Check: Why does a CNN need far fewer parameters than a feedforward network for image classification? A feedforward network connects every input pixel to every hidden neuron. For a 224×224 image (50,176 pixels) with a 1,000-neuron hidden layer, that’s 50 million connections in the first layer alone. A CNN uses small filters (3×3 = 9 weights) that slide across the entire image, reusing the same 9 weights at every position. With 64 filters, that’s just 576 parameters for the first layer — 100,000× fewer than the feedforward approach.

RNNs (Recurrent Neural Networks)

The problem they solve: Sequential data — text, time series, audio — where order matters. “The dog bit the man” means something very different from “The man bit the dog.”

How they work: RNNs maintain a “hidden state” — a memory of what they’ve seen so far. At each step, they take the current input AND the previous hidden state, producing a new hidden state and an output.

Step 1: Input "The" + initial state → hidden state₁
Step 2: Input "cat" + hidden state₁ → hidden state₂
Step 3: Input "sat" + hidden state₂ → hidden state₃

Each hidden state carries information about everything the network has seen up to that point.

The vanishing gradient problem: As sequences get longer, gradients computed during backpropagation shrink exponentially as they propagate back through time. By step 50, the gradient from step 1 is effectively zero — the network can’t learn connections between distant elements.

LSTM (Long Short-Term Memory): Solves the vanishing gradient problem with three gates:

Forget gate: Decides what to discard from memory
Input gate: Decides what new information to store
Output gate: Decides what to output from the current state

These gates let LSTMs maintain information over hundreds of steps — a significant improvement over basic RNNs.

Transformers

The problem they solve: RNNs process sequentially (word by word), which is slow and loses information over long distances. Transformers process entire sequences simultaneously.

How they work: The self-attention mechanism computes relationships between every pair of elements in the input. For a 100-word sentence, it calculates 10,000 pairwise relationships — word 1 to word 1, word 1 to word 2, …, word 100 to word 100.

Why this matters: In “The cat that the dog chased ran up the tree,” attention connects “cat” directly to “ran” even though they’re separated by 5 words. No information degradation. No vanishing gradients.

The trade-off: Attention is computationally expensive. Quadratic complexity means doubling the sequence length quadruples the computation. This is why context windows have limits (128K, 200K tokens) — there’s a practical ceiling on how much attention can process.

Dominance: Transformers power every major language model (GPT-4, Claude, Gemini), vision models (ViT), and audio models. Their parallel processing maps perfectly to GPU hardware.

Architecture Comparison

	Feedforward	CNN	RNN/LSTM	Transformer
Best for	Tabular data	Images, spatial	Time series, short sequences	Text, long sequences
Structure awareness	None	Spatial (2D)	Sequential	Global (attention)
Parallelizable	Yes	Yes	No (sequential)	Yes
Long-range dependency	N/A	Limited	Weak (RNN) / Better (LSTM)	Strong
Example models	Simple classifiers	ResNet, VGG	LSTM networks	GPT, BERT, ViT

✅ Quick Check: Voice assistants (Siri, Alexa) convert speech audio into text. What architecture best handles the audio-to-text task? Transformers — modern speech recognition (Whisper by OpenAI) uses transformer architectures. Audio is represented as spectrograms (frequency over time), and the transformer’s self-attention captures both local acoustic patterns and long-range dependencies (connecting a word’s beginning to its end, maintaining context across phrases). Earlier systems used RNNs/LSTMs, but transformers achieved higher accuracy with faster training.

Key Takeaways

Feedforward networks are simple but ignore data structure — use for tabular data
CNNs use convolutional filters that slide across images, detecting spatial patterns with far fewer parameters than feedforward
RNNs process sequences step-by-step with a hidden state — limited by vanishing gradients on long sequences
LSTMs add gates to RNNs, solving the vanishing gradient problem for medium-length sequences
Transformers use self-attention to process entire sequences at once — dominant for text, increasingly used for images and audio
Architecture choice depends on data type, sequence length, speed requirements, and available compute

Up Next

These architectures are powerful — but they all face the same enemy: overfitting. Lesson 5 covers why neural networks memorize instead of learn, and the techniques that fix it — dropout, batch normalization, regularization, and data augmentation.