LLM Evals & Testing

Token-F1 & ROUGE-style Overlap

Exact-match is brutal on anything longer than a label. Ask a model "Who painted the Mona Lisa and when?" and the gold is "Leonardo da Vinci, around 1503". The model says "It was painted by Leonardo da Vinci in roughly 1503." Exact-match scores that a zero. It is obviously a good answer. For free-text answers you need partial-credit scoring, and the two standard tools are token-F1 and ROUGE.

Token-F1 treats each answer as a bag of words and asks how much they overlap. Lowercase, split on whitespace, and compare the two token sets:

precision = (tokens in both) / (tokens in the prediction) -> penalizes padding the answer with junk.
recall = (tokens in both) / (tokens in the gold) -> penalizes leaving out the gold's words.
F1 = the harmonic mean, 2 * P * R / (P + R) -> one number that punishes being weak on either side.

Identical strings score 1.0; answers that share no words score 0.0. Work one by hand: prediction "the cat ran", gold "the dog sat". The only shared token is the, so precision is 1/3, recall is 1/3, and F1 is 1/3.

ROUGE-N is the metric summarization is graded on. Instead of single words it counts overlapping n-grams (contiguous runs of n tokens) and reports recall: of the n-grams in the gold, how many appear in the prediction?

def ngrams(tokens, n):
    return [tuple(tokens[i:i + n]) for i in range(len(tokens) - n + 1)]

For gold "the cat sat on" and prediction "the cat sat": the gold's four unigrams are the, cat, sat, on; three appear, so ROUGE-1 is 3/4. The gold's three bigrams are (the,cat), (cat,sat), (sat,on); two appear, so ROUGE-2 is 2/3. ROUGE-2 is usually lower than ROUGE-1 because matching word order is harder than matching words, which is exactly why you report both.

What to build.

token_f1(pred, gold) -> the F1 over lowercase token sets; return 0.0 if either side is empty or there is no overlap (never divide by zero).
rouge_n(pred, gold, n) -> the fraction of the gold's n-grams that appear in the prediction; 0.0 when the gold has no n-grams.

How this differs from the library. For teaching we score over token sets, so a repeated word counts once. Production token-F1 (the SQuAD style) uses multiset counts, so duplicates are weighted, and real ROUGE adds ROUGE-L, which scores the longest common subsequence rather than fixed n-grams. The idea is the same; the production versions just count duplicates and word order more carefully.

Press Run to score a prediction against a reference.

Your turn

Write token_f1(pred, gold) computing the F1 of the two lowercase token sets (precision = overlap / prediction-tokens, recall = overlap / gold-tokens, F1 = the harmonic mean), returning 0.0 when either side is empty or there is no overlap. Write rouge_n(pred, gold, n) returning the fraction of the gold's n-grams (contiguous n-token tuples) that appear in the prediction, returning 0.0 when the gold has no n-grams.

Spotted a problem in this lesson? Report it

Code · runs in your browser

Output

Back Next lesson

Token-F1 & ROUGE-style Overlap

This lesson is locked

Best on a laptop

Token-F1 &amp; ROUGE-style Overlap

This lesson is locked

Best on a laptop

Token-F1 & ROUGE-style Overlap