Syllabus Lesson 105 of 239 · Neural-Net Intuition, LLMs & AI Capstone
Neural-Net Intuition, LLMs & AI Capstone

Self-Attention from Scratch

You have built a neuron and a forward pass. Now you will build the one idea that turned plain neural nets into transformers and, from there, into the LLMs behind every chat assistant: self-attention. It is a handful of matrix multiplies, and you can do the whole thing in numpy.

The intuition

Imagine each word in a sentence wants to gather information from the other words. Self-attention gives every position three vectors:

  • A query (Q): what this position is looking for.
  • A key (K): what each position offers, used for matching.
  • A value (V): the actual information each position carries.

A position compares its query against every key to decide how much to listen to each other position, then takes a weighted mix of the values using those amounts. Strong match -> big weight -> that value contributes more. That is the entire mechanism: queries attend to keys, and the answer is a weighted blend of values.

The math: scaled dot-product attention

The matching score between a query and a key is their dot product (bigger when they point the same way). Stack all queries into a matrix Q and all keys into K, and every score at once is Q @ K.T. We divide by sqrt(d_k) (the key dimension) so the scores do not blow up as vectors get longer:

scores = Q @ K.T / np.sqrt(d_k)

Why divide at all? A dot product grows with the dimension. If d_k is 64 and the entries are around 1, raw scores land near 64, and numbers that large push softmax onto a single key and flatten the gradient. Dividing by sqrt(d_k), here sqrt(64) = 8, pulls the scores back to a sane range so attention stays smooth and trainable. In the tiny d_k = 1 example you run below the divisor is just 1, so you will not see it bite, but at real model sizes it is essential.

Each row of scores holds one query's scores against every key. We turn each row into weights that are positive and sum to 1 with a softmax over the last axis:

weights = softmax(scores, axis=-1)   # each row sums to 1

Softmax exponentiates and normalizes. Done naively, np.exp of a large score overflows to inf and the weights come out as nan. The fix is the standard numerically stable trick: subtract each row's max before exponentiating. It does not change the result mathematically (the same constant cancels top and bottom), but it keeps every exponent <= 0:

def softmax(x):
    x = x - x.max(axis=-1, keepdims=True)   # stability: largest exponent is now 0
    e = np.exp(x)
    return e / e.sum(axis=-1, keepdims=True)

Finally, each query's output is its weights applied to the values, which is one more matrix multiply:

output = weights @ V

Put together, the famous formula is just softmax(Q @ K.T / sqrt(d_k)) @ V. If Q is (n, d_k), K is (m, d_k), and V is (m, d_v), then weights is (n, m) and output is (n, d_v): one output vector per query, each a blend of the value rows.

This is the whole transformer trick

A real transformer does not stop at one attention computation. Multi-head attention runs several of these in parallel (each head learns its own Q, K, V projections, so different heads attend to different things) and concatenates the results. A transformer block then stacks attention with a small feed-forward network and a residual connection, and a model stacks dozens of those blocks. But the beating heart of every one of them is exactly the function you are about to write: scores, stable softmax, weighted sum of values. Build this and you understand the core operation an LLM repeats when it "pays attention" to the prompt.

One honest caveat: here Q, K and V are handed to you as fixed arrays. In a real model they are not given, they are produced by multiplying each token's embedding by three learned weight matrices, W_q, W_k and W_v. Those matrices, and the ones in every block, are exactly what training adjusts using the gradient descent from the last lesson. So what you build here is the fixed core of attention that is identical in every transformer; the learning lives in the projections that feed it.

Press Run to see attention turn a tiny Q/K/V into an output and print the weight matrix so you can read off who attended to whom.

Your turn

Implement scaled dot-product attention in numpy. Write attention(Q, K, V) that returns a tuple (output, weights). Compute scores = Q @ K.T / np.sqrt(d_k) where d_k is the last dimension of Q; apply a numerically stable softmax over the last axis (subtract each row's max before np.exp) to get weights whose rows sum to 1; then output = weights @ V. Return output first, then the weights matrix. Then set out, w = attention(Q, K, V) for the provided arrays.

Spotted a problem in this lesson? Report it

Code · runs in your browser
Output