How Training Works
How neural networks learn — loss functions, backpropagation, gradient descent, and the training loop that turns random weights into intelligence.
Premium Course Content
This lesson is part of a premium course. Upgrade to Pro to unlock all premium courses and content.
- Access all premium courses
- 1000+ AI skill templates included
- New content added weekly
Learning From Mistakes
🔄 Lesson 2 covered the forward pass — how data flows through neurons and layers to produce a prediction. But a network with random weights makes terrible predictions. Training is the process that turns those random weights into useful ones.
The training loop is simple in concept: make a prediction, measure the error, figure out which weights caused the error, adjust them, and repeat. Do this millions of times, and the network learns.
Step 1: The Loss Function
After the forward pass produces a prediction, the loss function measures how wrong it is.
For classification (spam or not spam):
| Prediction | Truth | Loss |
|---|---|---|
| 0.95 (spam) | spam | Very low — almost right |
| 0.50 (uncertain) | spam | Medium — not confident enough |
| 0.10 (not spam) | spam | Very high — completely wrong |
Cross-entropy loss is the standard for classification. It penalizes confident wrong predictions more harshly than uncertain ones. Predicting 0.01 for a true positive is punished much more than predicting 0.40.
For regression (predicting a number):
Mean squared error (MSE) measures the average squared difference between predictions and actual values. Predicting $250K for a $300K house gives a higher loss than predicting $290K.
The loss function is fixed — you choose it before training and it doesn’t change. What changes is the predictions, which improve as weights get adjusted.
Step 2: Backpropagation
The forward pass runs input → output. Backpropagation runs backward — output → input — calculating how much each weight contributed to the error.
Analogy: Imagine a factory production line with 5 stations. The final product is defective. You need to trace back through each station to find which ones caused the defect and how much each one contributed. Station 4 might be 60% responsible, Station 2 might be 30%, and Station 5 might be 10%.
Backpropagation does exactly this, using calculus (the chain rule) to compute how much each weight in every layer influenced the final loss. These computed values are called gradients — each gradient tells you both the direction and magnitude of the change needed for that specific weight.
Key insight: Backpropagation doesn’t change the weights — it just calculates the gradients. The actual weight updates happen in the next step.
✅ Quick Check: Why does backpropagation work backward (output to input) instead of forward? Because the error is measured at the output. To find each weight’s contribution, you need to trace the error back through the layers that produced it — like following a river upstream to find the source. The chain rule of calculus makes this efficient: each layer’s gradient depends on the gradients of the layer after it, so computing backward is naturally sequential from output to input.
Step 3: Gradient Descent
Gradient descent uses the gradients from backpropagation to actually update the weights.
Analogy: You’re standing on a mountain in thick fog. You can’t see the valley floor (the optimal weights), but you can feel which direction the ground slopes beneath your feet (the gradient). Take a step downhill. Feel the slope again. Step again. Eventually, you reach the bottom.
New weight = Old weight - (learning rate × gradient)
The learning rate controls the step size:
- Too large → you overshoot the valley and bounce around
- Too small → you take tiny steps and training takes forever
- Just right → you converge to a good solution in reasonable time
Common learning rates: 0.001 to 0.01 for most tasks. But the right value depends on the problem, the architecture, and the optimizer.
The Training Loop
Put it all together and training follows this cycle:
1. Forward pass: Input → prediction
2. Loss: Compare prediction to truth
3. Backpropagation: Calculate gradients for every weight
4. Gradient descent: Update weights to reduce loss
5. Repeat with next batch of data
One complete pass through the entire training dataset is called an epoch. Training typically runs for 10 to 100+ epochs — meaning the network sees every training example dozens of times, refining its weights each time.
Optimizers: Smarter Gradient Descent
Plain gradient descent has limitations. Modern training uses optimizers that improve on the basic approach:
| Optimizer | How It Improves | Use Case |
|---|---|---|
| SGD (Stochastic Gradient Descent) | Updates after each small batch, not the full dataset | Baseline, simple tasks |
| Adam | Adapts learning rate per parameter automatically | Default choice for most tasks |
| AdamW | Adam with weight decay for better regularization | Transformer training |
Adam is the default in practice. It maintains separate learning rates for each parameter and adapts them based on the gradient history — faster where it’s safe, slower where it’s sensitive. You still set an initial learning rate, but Adam adjusts from there.
✅ Quick Check: Two networks are being trained. Network A has loss [3.2, 2.1, 1.4, 0.9, 0.5, 0.3] over 6 epochs. Network B has loss [3.2, 0.1, 3.8, 0.2, 4.1, 0.15] over 6 epochs. Which is training better? Network A — the loss decreases steadily. Network B oscillates wildly, suggesting the learning rate is too high. The weight updates are too large, causing the network to overshoot the optimum repeatedly. Network B needs a lower learning rate.
Epochs, Batches, and Iterations
These terms describe how training data is organized:
| Term | Meaning | Example (10,000 images, batch size 32) |
|---|---|---|
| Batch | A subset of training data processed together | 32 images at once |
| Iteration | One batch through forward pass + backprop + update | 1 of 313 iterations |
| Epoch | One complete pass through all training data | All 10,000 images (313 iterations) |
Why batches? Processing the entire dataset at once (batch gradient descent) is memory-intensive and slow. Processing one example at a time (stochastic) is noisy and inefficient. Mini-batches (typically 32-256 examples) balance memory efficiency, training speed, and gradient quality.
Key Takeaways
- The loss function measures how wrong the prediction is — cross-entropy for classification, MSE for regression
- Backpropagation traces errors backward through layers, calculating each weight’s contribution to the loss
- Gradient descent updates weights in the direction that reduces loss — the learning rate controls step size
- The training loop: forward pass → loss → backpropagation → gradient descent → repeat
- Adam optimizer adapts learning rates per parameter — it’s the default choice for most training
- Training runs for multiple epochs, processing data in mini-batches (typically 32-256 examples)
Up Next
You understand single neurons and how they learn. Lesson 4 introduces the specialized architectures — CNNs for images, RNNs for sequences, and transformers for everything else. Each solves a specific problem that basic feedforward networks can’t handle.
Knowledge Check
Complete the quiz above first
Lesson completed!