Backprop & Gradient Checking
You trained a toy model by nudging weights downhill. But where do the gradients come from, and how do you know they are right? This lesson answers both, and the second answer, gradient checking, is the single habit that separates working neural nets from silently broken ones.
Backprop is just the chain rule. Take one example through a tiny model: predict z = w * x, then measure squared error L = (z - y)**2. To improve w you need dL/dw. Compute it in two hops and multiply:
dL/dz = 2 * (z - y) # how loss moves with the prediction
dz/dw = x # how the prediction moves with w
dL/dw = dL/dz * dz/dw = 2 * (z - y) * xThat product IS backpropagation, in miniature: each layer multiplies the gradient flowing back by its own local derivative. Stack a million of these and you have a deep network.
Now, is your formula correct? Verify it numerically. The definition of a derivative is a tiny nudge, so estimate it with a central finite difference and check the analytic gradient agrees:
numeric = (L(w + h) - L(w - h)) / (2 * h) # h tiny, e.g. 1e-5
assert abs(analytic - numeric) < 1e-6 # they should agree to many digitsIf they disagree, your backward pass has a bug, every serious DL codebase gradient-checks for exactly this reason.
Build two functions. loss(w, x, y) returns (w*x - y)**2. grad(w, x, y) returns the analytic gradient dL/dw = 2*(w*x - y)*x. The hidden tests will gradient-check your grad against a finite difference of your own loss across many random points, so a wrong formula cannot pass. Press Run to see analytic vs numeric line up.
Write loss(w, x, y) returning the squared error (w*x - y)**2, and grad(w, x, y) returning its analytic gradient with respect to w, which by the chain rule is 2*(w*x - y)*x. Your grad will be gradient-checked against a central finite difference of your own loss at many random points, so it must be the true derivative, not an approximation or a constant.
This lesson is locked
Lessons open one at a time. Finish the previous lesson to unlock this one.