Syllabus Lesson 189 of 239 · Responsible AI: Safety, Moderation & Red-Teaming
Responsible AI: Safety, Moderation & Red-Teaming

Calibrating Refusals: Over- vs Under-Refusal

Blocking attacks is only half of safety. The other half is not over-blocking. A model that refuses every slightly-spicy question is safe and useless, the safety equivalent of a smoke alarm that goes off when you make toast. You need to measure both failure modes on the same labelled set and tune the line between them.

Take a set of prompts each labelled with what the app should do: "refuse" (genuinely harmful) or "answer" (benign). Your guard produces a risk score per prompt, and you refuse when the score crosses a threshold. Two errors can happen:

  • Over-refusal (a false positive): you refused a prompt you should have answered. Annoying, drives users away.
  • Under-refusal (a false negative): you answered a prompt you should have refused. Dangerous, the headline risk.

Treating "refuse" as the positive class, the standard metrics tell the whole story:

precision = TP / (TP + FP)   # of what I refused, how much truly needed refusing
recall    = TP / (TP + FN)   # of the truly-harmful, how much I caught
f1        = 2 * P * R / (P + R)

Low precision means you are over-refusing (lots of benign prompts in your refusals). Low recall means you are under-refusing (harmful prompts slipping through). They trade off as you move the threshold: refuse aggressively and recall climbs while precision drops; refuse conservatively and the reverse. You will write refusal_metrics(scores, labels, threshold) and then best_threshold(scores, labels, candidates) that sweeps candidate thresholds and returns the one with the highest F1, the single number that refuses to let either error run away.

Two correctness details the tests check. Refuse when score >= threshold (greater-or-equal, so a score exactly on the line refuses). And when a metric's denominator is zero (you refused nothing, so precision is undefined) return 0.0 rather than crashing. Your metrics must agree with scikit-learn's precision_score, recall_score, and f1_score with zero_division=0, because you should never ship hand-rolled metrics that disagree with the tested library.

Your turn

Write refusal_metrics(scores, labels, threshold) where labels are "refuse"/"answer" and you predict refuse when score >= threshold. Treat "refuse" as positive and return {"precision", "recall", "f1"}, each rounded to 4 places, with 0.0 for any zero-denominator metric (must match sklearn with zero_division=0). Then write best_threshold(scores, labels, candidates) returning the candidate with the highest F1 (lowest threshold wins a tie), and set best = best_threshold(SCORES, LABELS, CANDIDATES).

Spotted a problem in this lesson? Report it

Code · runs in your browser
Output