Backpropagation

The algorithm that makes deep learning work.

The Story

Gradient descent needs the gradient. For a single weight, numerical gradient works. But for millions of weights, it is too slow. Backpropagation computes all gradients efficiently using the chain rule.

The Chain Rule

If a function is a composition of simple functions, its gradient is a product of simple gradients:

df/dx = df/dg * dg/dx

Forward Then Backward

  1. Forward pass: compute the output from input to output, saving intermediate values.
  2. Backward pass: starting from the loss, propagate gradients backward through each layer using the chain rule.
Backpropagation is not a different algorithm. It is gradient descent + the chain rule + clever bookkeeping. Once you see this, the magic dissolves and you understand.

Exercises

Loading Python runtime (Pyodide)...

Primary Source: 3Blue1Brown — Backpropagation (YouTube)