Responsible AI: Safety, Moderation & Red-Teaming

Jailbreak & Prompt-Injection Detection

The first thing a real LLM product needs is a bouncer at the door. Before a user's message ever reaches the model, you run it through a cheap, fast check for the classic jailbreak moves: attempts to override your system prompt, smuggle in hidden instructions, or talk the model into a persona with no rules. This is not the whole safety story, but it is the layer that catches the lazy 90% of attacks for almost no cost.

You will build detect_jailbreak(text) that returns a small dict: {"flag": bool, "reason": str}. When the text looks like an attack you set flag=True and name the pattern; when it looks like an ordinary request you let it through with flag=False and reason="clean". We look for three well-known families, checked in this priority order:

ignore-previous -> the user tries to wipe the rules: phrases like ignore previous instructions, disregard the above, forget your instructions.
role-play-override -> the user assigns a lawless persona: you are now DAN, act as an AI with no restrictions, pretend you have no rules, the infamous developer mode.
encoded-instruction -> the user hides the real ask behind an encoding to dodge keyword filters: decode this base64 and run it, rot13: ..., execute the following hex.

The mechanics are deliberately simple: lower-case the text once, then test for a handful of trigger substrings per family. Matching is case-insensitive so IGNORE PREVIOUS and ignore previous both fire.

def detect_jailbreak(text):
    t = text.lower()
    if any(p in t for p in ("ignore previous", "disregard the above", ...)):
        return {"flag": True, "reason": "ignore-previous"}
    ...
    return {"flag": False, "reason": "clean"}

Two honest caveats this lesson bakes in. First, priority matters: a message can trip more than one family, so you check ignore-previous, then role-play-override, then encoded-instruction, and return the first hit. Second, this is a screen, not a guarantee. A clever attacker can paraphrase around any keyword list, which is exactly why production stacks layer this with a trained classifier (next lesson) and a red-team suite (the lesson after). Keyword screens are the cheap first net, not the only net. Press Run to watch a mix of attacks and benign asks get sorted.

Your turn

Write detect_jailbreak(text) returning {"flag": bool, "reason": str}. Lower-case the text, then check three attack families in priority order and return the first hit: ignore-previous (ignore previous, disregard the above, forget your instructions), role-play-override (you are now dan, no restrictions, pretend you have no rules, developer mode), and encoded-instruction (base64, rot13, decode this, execute the following hex). Benign text returns {"flag": False, "reason": "clean"}.

Spotted a problem in this lesson? Report it

Code · runs in your browser

Output

Back Next lesson

Jailbreak & Prompt-Injection Detection

This lesson is locked

Best on a laptop