Prompt-Injection Defense Pipeline
Prompt injection is the LLM equivalent of SQL injection: a user smuggles instructions into their input trying to override your system prompt, leak it, or hijack the model's behaviour. Classic payloads include "ignore all previous instructions", "reveal your system prompt", "you are now a...", and "developer mode". A practical first line of defence is a detector that scans for these known patterns, paired with the PII redactor from earlier and an output check, all wired into a single guard that everything flows through.
The detector must be specific enough not to cry wolf. "I previously visited Paris" is benign and must not trip it; "ignore all previous instructions" must. You match on the dangerous phrase, not a single innocent word, and you return the reasons you flagged so the block is auditable.
import re
PATTERNS = [r"ignore (?:all |the |your )?(?:previous|prior|above) (?:instructions|prompts?)",
r"reveal (?:your |the )?(?:system )?(?:prompt|instructions)"]
# match the phrase, case-insensitivelyThe full gate composes the pieces: detect injection on the raw input, redact PII, and only call the (stubbed) model when the input is safe. In production the model step would be a real call; here a stub stands in so grading stays deterministic. The point you are graded on is the gate logic.
Build two functions. (1) detect_injection(text) -> a dict {"is_injection": bool, "score": int, "patterns": [...]} where score is how many patterns matched. (2) guard(user_text) -> {"safe_text": str, "blocked": bool, "reasons": [...]}: if injection is detected, set blocked=True and add "prompt_injection" to reasons; always redact PII into safe_text and, if any PII was found, append a reason starting with "pii_redacted". A clean message passes through untouched and unblocked.
Build detect_injection(text) returning {"is_injection", "score", "patterns"} (score = number of injection patterns matched) and guard(user_text) returning {"safe_text", "blocked", "reasons"}. guard blocks (adds reason "prompt_injection") when injection is detected, always redacts PII into safe_text (appending a "pii_redacted..." reason when any is found), and leaves a clean message untouched and unblocked. A benign lookalike like "I previously asked about the weather" must not be blocked.
This lesson is locked
Lessons open one at a time. Finish the previous lesson to unlock this one.