How Training Works

Learning From Mistakes

🔄 Lesson 2 covered the forward pass — how data flows through neurons and layers to produce a prediction. But a network with random weights makes terrible predictions. Training is the process that turns those random weights into useful ones.

The training loop is simple in concept: make a prediction, measure the error, figure out which weights caused the error, adjust them, and repeat. Do this millions of times, and the network learns.

Step 1: The Loss Function

After the forward pass produces a prediction, the loss function measures how wrong it is.

For classification (spam or not spam):

Prediction	Truth	Loss
0.95 (spam)	spam	Very low — almost right
0.50 (uncertain)	spam	Medium — not confident enough
0.10 (not spam)	spam	Very high — completely wrong

Cross-entropy loss is the standard for classification. It penalizes confident wrong predictions more harshly than uncertain ones. Predicting 0.01 for a true positive is punished much more than predicting 0.40.

For regression (predicting a number):

Mean squared error (MSE) measures the average squared difference between predictions and actual values. Predicting $250K for a $300K house gives a higher loss than predicting $290K.

The loss function is fixed — you choose it before training and it doesn’t change. What changes is the predictions, which improve as weights get adjusted.

Step 2: Backpropagation

The forward pass runs input → output. Backpropagation runs backward — output → input — calculating how much each weight contributed to the error.

Analogy: Imagine a factory production line with 5 stations. The final product is defective. You need to trace back through each station to find which ones caused the defect and how much each one contributed. Station 4 might be 60% responsible, Station 2 might be 30%, and Station 5 might be 10%.

Backpropagation does exactly this, using calculus (the chain rule) to compute how much each weight in every layer influenced the final loss. These computed values are called gradients — each gradient tells you both the direction and magnitude of the change needed for that specific weight.

Key insight: Backpropagation doesn’t change the weights — it just calculates the gradients. The actual weight updates happen in the next step.

✅ Quick Check: Why does backpropagation work backward (output to input) instead of forward? Because the error is measured at the output. To find each weight’s contribution, you need to trace the error back through the layers that produced it — like following a river upstream to find the source. The chain rule of calculus makes this efficient: each layer’s gradient depends on the gradients of the layer after it, so computing backward is naturally sequential from output to input.

Step 3: Gradient Descent

Gradient descent uses the gradients from backpropagation to actually update the weights.

Analogy: You’re standing on a mountain in thick fog. You can’t see the valley floor (the optimal weights), but you can feel which direction the ground slopes beneath your feet (the gradient). Take a step downhill. Feel the slope again. Step again. Eventually, you reach the bottom.

New weight = Old weight - (learning rate × gradient)

The learning rate controls the step size:

Too large → you overshoot the valley and bounce around
Too small → you take tiny steps and training takes forever
Just right → you converge to a good solution in reasonable time

Common learning rates: 0.001 to 0.01 for most tasks. But the right value depends on the problem, the architecture, and the optimizer.

The Training Loop

Put it all together and training follows this cycle:

1. Forward pass: Input → prediction
2. Loss: Compare prediction to truth
3. Backpropagation: Calculate gradients for every weight
4. Gradient descent: Update weights to reduce loss
5. Repeat with next batch of data

One complete pass through the entire training dataset is called an epoch. Training typically runs for 10 to 100+ epochs — meaning the network sees every training example dozens of times, refining its weights each time.

Optimizers: Smarter Gradient Descent

Plain gradient descent has limitations. Modern training uses optimizers that improve on the basic approach:

Optimizer	How It Improves	Use Case
SGD (Stochastic Gradient Descent)	Updates after each small batch, not the full dataset	Baseline, simple tasks
Adam	Adapts learning rate per parameter automatically	Default choice for most tasks
AdamW	Adam with weight decay for better regularization	Transformer training

Adam is the default in practice. It maintains separate learning rates for each parameter and adapts them based on the gradient history — faster where it’s safe, slower where it’s sensitive. You still set an initial learning rate, but Adam adjusts from there.

✅ Quick Check: Two networks are being trained. Network A has loss [3.2, 2.1, 1.4, 0.9, 0.5, 0.3] over 6 epochs. Network B has loss [3.2, 0.1, 3.8, 0.2, 4.1, 0.15] over 6 epochs. Which is training better? Network A — the loss decreases steadily. Network B oscillates wildly, suggesting the learning rate is too high. The weight updates are too large, causing the network to overshoot the optimum repeatedly. Network B needs a lower learning rate.

Epochs, Batches, and Iterations

These terms describe how training data is organized:

Term	Meaning	Example (10,000 images, batch size 32)
Batch	A subset of training data processed together	32 images at once
Iteration	One batch through forward pass + backprop + update	1 of 313 iterations
Epoch	One complete pass through all training data	All 10,000 images (313 iterations)

Why batches? Processing the entire dataset at once (batch gradient descent) is memory-intensive and slow. Processing one example at a time (stochastic) is noisy and inefficient. Mini-batches (typically 32-256 examples) balance memory efficiency, training speed, and gradient quality.

Key Takeaways

The loss function measures how wrong the prediction is — cross-entropy for classification, MSE for regression
Backpropagation traces errors backward through layers, calculating each weight’s contribution to the loss
Gradient descent updates weights in the direction that reduces loss — the learning rate controls step size
The training loop: forward pass → loss → backpropagation → gradient descent → repeat
Adam optimizer adapts learning rates per parameter — it’s the default choice for most training
Training runs for multiple epochs, processing data in mini-batches (typically 32-256 examples)

Up Next

You understand single neurons and how they learn. Lesson 4 introduces the specialized architectures — CNNs for images, RNNs for sequences, and transformers for everything else. Each solves a specific problem that basic feedforward networks can’t handle.