Neural-Net Intuition, LLMs & AI Capstone

How Models Learn

A fresh network has random weights and is useless. Training is the process of nudging those weights until the output gets good. Two ideas drive it.

1. A loss function measures wrongness

You need a single number that says how bad the predictions are. For regression, the go-to is mean squared error (MSE): average the squared gaps between truth and prediction.

def mse(y_true, y_pred):
    diffs = np.asarray(y_true) - np.asarray(y_pred)
    return float(np.mean(diffs ** 2))

Squaring punishes big misses harder and keeps everything positive. Lower loss is better; zero loss means a perfect fit.

2. Gradient descent rolls downhill

Picture the loss as a hilly surface and the weights as your position on it. The gradient is the slope: it points uphill, toward more loss. So to reduce loss, you take a small step in the opposite direction.

w = w - learning_rate * gradient

The learning rate is the step size. Repeat this thousands of times and the weights roll down toward a low-loss valley. That loop is all that training is.

For a toy model y = w * x with MSE, the gradient of the loss with respect to w works out to mean(2 * (w*x - y) * x). You will take exactly one step and watch the loss drop.

From one weight to a whole network

The two-layer network you built in the last lesson has many weights, not one. Training it means finding the gradient of the loss with respect to every weight, layer by layer. That is backpropagation: start at the loss and apply the chain rule backwards through each layer to get every weight's slope, then take the same downhill step on all of them at once. The arithmetic is heavier, but the idea is identical to the single step you take below. In real code you never hand-derive it: frameworks like PyTorch record the forward pass and one loss.backward() call computes every gradient for you. We do the one-weight version here so the mechanism stays concrete.

Your turn

Implement mse(y_true, y_pred) (mean of squared differences). Then, for the toy model y = w * x, compute loss_before, take one gradient-descent step using grad = np.mean(2 * (w*x - y) * x) and w = w - lr * grad, then compute loss_after. The loss must go down.

Spotted a problem in this lesson? Report it

Code · runs in your browser

Output

Back Next lesson

How Models Learn

1. A loss function measures wrongness

2. Gradient descent rolls downhill

From one weight to a whole network

This lesson is locked

Best on a laptop