LLM Evals & Testing

Accuracy vs Precision/Recall/F1 for Classification

A huge share of LLM work is really classification: route this ticket, label this sentiment, flag this message as spam. For those, accuracy alone is a trap. Build a spam filter on a mailbox that is 95% ham and a model that labels everything ham scores 95% accuracy while catching zero spam. The number looks great; the product is useless. To see that, you need precision, recall, and F1 broken out for the class you actually care about.

Pick a positive class (say spam) and count four buckets:

TP true positives: predicted spam, was spam.
FP false positives: predicted spam, was ham (a good email sent to junk).
FN false negatives: predicted ham, was spam (spam that slipped through).

precision = TP / (TP + FP)   # of what I flagged, how much was right
recall    = TP / (TP + FN)   # of the real positives, how many I caught
f1        = 2 * P * R / (P + R)
accuracy  = correct / total  # all classes, all predictions

These pull in different directions. Flag aggressively and recall rises while precision falls; flag conservatively and the reverse. F1 is the single number that refuses to let either collapse. On an imbalanced set, accuracy and F1 diverge hard, which is the whole point of computing them separately.

You will implement these by hand because the formulas are worth knowing cold for interviews. But you should never ship hand-rolled metrics when a tested library exists, so the tests also check your numbers against scikit-learn's precision_score, recall_score, and f1_score. If you match sklearn, you implemented it right.

from sklearn.metrics import precision_score, recall_score, f1_score
precision_score(y_true, y_pred, pos_label="spam")

What to build. classification_metrics(y_true, y_pred, positive_label) returning a dict {"accuracy", "precision", "recall", "f1"}, each rounded to 4 places. Guard the zero-denominator cases (no predicted positives, no actual positives) by returning 0.0 for that metric instead of dividing by zero. Press Run to print a report and watch accuracy and F1 disagree on an imbalanced set.

Your turn

Write classification_metrics(y_true, y_pred, positive_label) returning {'accuracy', 'precision', 'recall', 'f1'} for the chosen positive class, each rounded to 4 places. Count TP/FP/FN against positive_label; precision = TP/(TP+FP), recall = TP/(TP+FN), f1 = the harmonic mean, accuracy = correct/total. Return 0.0 for any metric whose denominator is zero. Your numbers must match scikit-learn's precision_score/recall_score/f1_score.

Spotted a problem in this lesson? Report it

Code · runs in your browser

Output

Back Next lesson

Accuracy vs Precision/Recall/F1 for Classification

This lesson is locked

Best on a laptop