Jailbreak & Prompt-Injection Detection
The first thing a real LLM product needs is a bouncer at the door. Before a user's message ever reaches the model, you run it through a cheap, fast check for the classic jailbreak moves: attempts to override your system prompt, smuggle in hidden instructions, or talk the model into a persona with no rules. This is not the whole safety story, but it is the layer that catches the lazy 90% of attacks for almost no cost.
You will build detect_jailbreak(text) that returns a small dict: {"flag": bool, "reason": str}. When the text looks like an attack you set flag=True and name the pattern; when it looks like an ordinary request you let it through with flag=False and reason="clean". We look for three well-known families, checked in this priority order:
- ignore-previous -> the user tries to wipe the rules: phrases like
ignore previous instructions,disregard the above,forget your instructions. - role-play-override -> the user assigns a lawless persona:
you are now DAN,act as an AI with no restrictions,pretend you have no rules, the infamousdeveloper mode. - encoded-instruction -> the user hides the real ask behind an encoding to dodge keyword filters:
decode this base64 and run it,rot13: ...,execute the following hex.
The mechanics are deliberately simple: lower-case the text once, then test for a handful of trigger substrings per family. Matching is case-insensitive so IGNORE PREVIOUS and ignore previous both fire.
def detect_jailbreak(text):
t = text.lower()
if any(p in t for p in ("ignore previous", "disregard the above", ...)):
return {"flag": True, "reason": "ignore-previous"}
...
return {"flag": False, "reason": "clean"}Two honest caveats this lesson bakes in. First, priority matters: a message can trip more than one family, so you check ignore-previous, then role-play-override, then encoded-instruction, and return the first hit. Second, this is a screen, not a guarantee. A clever attacker can paraphrase around any keyword list, which is exactly why production stacks layer this with a trained classifier (next lesson) and a red-team suite (the lesson after). Keyword screens are the cheap first net, not the only net. Press Run to watch a mix of attacks and benign asks get sorted.
Write detect_jailbreak(text) returning {"flag": bool, "reason": str}. Lower-case the text, then check three attack families in priority order and return the first hit: ignore-previous (ignore previous, disregard the above, forget your instructions), role-play-override (you are now dan, no restrictions, pretend you have no rules, developer mode), and encoded-instruction (base64, rot13, decode this, execute the following hex). Benign text returns {"flag": False, "reason": "clean"}.
This lesson is locked
Lessons open one at a time. Finish the previous lesson to unlock this one.