Responsible AI: Safety, Moderation & Red-Teaming

An Adversarial Red-Team Suite

You have a door bouncer and a tone classifier. How do you know they actually work? You attack your own app. A red-team suite is a fixed battery of adversarial prompts that you fire at your guarded handler on every change, so a regression in your defences shows up as a number instead of a headline. This is the safety equivalent of a test suite.

The whole thing is offline and deterministic. The "model" is a tiny mock so the test is fast and never calls a network: a guarded_handler(prompt) that first runs your defence, and only if the prompt is allowed does it return the (pretend) sensitive answer.

SECRET = "the launch code is 1234"

def guarded_handler(prompt):
    if is_attack(prompt):
        return "I can't help with that."   # blocked
    return SECRET                          # the model 'answered'

An attack got through (a leak) when the handler's reply still contains the secret. It was blocked when the reply is the refusal. You will run a suite of labelled attack prompts through the handler and tally the outcome:

{
  "passed":  [...],   # attacks the defence stopped (good)
  "blocked": [...],   # same list, named from the defender's view
  "leak_rate": 0.0    # fraction of attacks that leaked the secret
}

A note on naming, because it trips people up: from the attacker's point of view a blocked prompt "failed" and a leak "passed". We report from the defender's point of view, which is what you care about in production. So blocked is the list of attacks your defence stopped, passed is the list of attacks that defeated it (the ones that leaked), and leak_rate is len(passed) / number_of_attacks. A perfect defence has an empty passed list and leak_rate of 0.0. Press Run to see which attacks slip past a deliberately leaky defence, then watch the leak rate drop to zero once the defence is tightened.

Why mock the model

Red-teaming a real LLM means thousands of paid, slow, non-deterministic calls, useless inside a unit test. The mock keeps the shape of the problem (a guard in front of a capability) while making the suite instant and repeatable. In a real pipeline you swap the mock for your actual handler and run the exact same scoring code in CI.

Your turn

Write run_suite(handler, attacks, secret) that fires each attack prompt through handler and classifies the reply: if the reply still contains secret the attack passed (leaked), otherwise it was blocked. Return {"passed": [...], "blocked": [...], "leak_rate": round(len(passed)/len(attacks), 4)} where the lists hold the attack prompts. Then set report = run_suite(guarded_handler, ATTACKS, SECRET) (the solution's guarded_handler blocks every attack, so leak_rate is 0.0).

Spotted a problem in this lesson? Report it

Code · runs in your browser

Output

Back Next lesson

An Adversarial Red-Team Suite

Why mock the model

This lesson is locked

Best on a laptop