How Neural Networks Actually Learn
A visual deep dive into the mechanics of neural network training — from forward passes to backpropagation, explained with animated visuals.
1B+
Parameters
Modern LLMs train billions of weights through this exact process
The Forward Pass
Training a neural network starts with the forward pass. Input data flows through layers of neurons, each applying a weighted sum followed by an activation function. The output is a prediction — which is usually wrong at first.

Each neuron receives inputs, multiplies them by learned weights, adds a bias, and passes the result through a non-linear activation function like ReLU or sigmoid. This non-linearity is what gives neural networks their power to learn complex patterns.
Loss Calculation
Once the network produces a prediction, we compare it to the actual target using a loss function. Common choices include Mean Squared Error for regression and Cross-Entropy for classification.
MSE
Regression Loss
Penalizes large errors quadratically
CE
Classification Loss
Measures probability distribution distance
MAE
Robust Loss
Less sensitive to outliers
The loss quantifies how wrong the model is. A high loss means the prediction is far from the truth. The goal of training is to minimize this loss — and that's where backpropagation comes in.
Backpropagation — The Learning Engine
Backpropagation is the algorithm that actually teaches the network. It computes the gradient of the loss with respect to every weight in the network by applying the chain rule of calculus — working backwards from the output to the input.

Backpropagation is simply the chain rule applied recursively through a computational graph. Nothing more, nothing less.
These gradients tell each weight how much it contributed to the error and in which direction it should change. Weights that caused more error get larger updates.
Gradient Descent — Taking Steps
With gradients computed, the optimizer updates all weights by taking a small step in the direction that reduces the loss. The learning rate controls how big each step is — too large and you overshoot, too small and training takes forever.

- SGD — The classic optimizer, simple but effective with momentum
- Adam — Adaptive learning rates per parameter, the go-to default
- AdamW — Adam with decoupled weight decay, preferred for transformers
- Learning rate schedulers — Warm up then decay for stable training
The Training Loop
These steps — forward pass, loss calculation, backpropagation, and weight update — repeat thousands or millions of times. Each cycle through the entire dataset is called an epoch. Over time, the network's predictions improve dramatically.

Modern deep learning frameworks like PyTorch and TensorFlow automate the gradient computation entirely. You define the forward pass, and the framework handles backpropagation automatically through automatic differentiation (autograd).
The most important skill in deep learning is not math — it is the ability to iterate quickly, experiment relentlessly, and debug fearlessly.