Syllabus Lesson 179 of 239 · Productionizing LLMs: Cost, Caching & Guardrails
Productionizing LLMs: Cost, Caching & Guardrails

Prompt-Injection Defense Pipeline

Prompt injection is the LLM equivalent of SQL injection: a user smuggles instructions into their input trying to override your system prompt, leak it, or hijack the model's behaviour. Classic payloads include "ignore all previous instructions", "reveal your system prompt", "you are now a...", and "developer mode". A practical first line of defence is a detector that scans for these known patterns, paired with the PII redactor from earlier and an output check, all wired into a single guard that everything flows through.

The detector must be specific enough not to cry wolf. "I previously visited Paris" is benign and must not trip it; "ignore all previous instructions" must. You match on the dangerous phrase, not a single innocent word, and you return the reasons you flagged so the block is auditable.

import re
PATTERNS = [r"ignore (?:all |the |your )?(?:previous|prior|above) (?:instructions|prompts?)",
            r"reveal (?:your |the )?(?:system )?(?:prompt|instructions)"]
# match the phrase, case-insensitively

The full gate composes the pieces: detect injection on the raw input, redact PII, and only call the (stubbed) model when the input is safe. In production the model step would be a real call; here a stub stands in so grading stays deterministic. The point you are graded on is the gate logic.

Build two functions. (1) detect_injection(text) -> a dict {"is_injection": bool, "score": int, "patterns": [...]} where score is how many patterns matched. (2) guard(user_text) -> {"safe_text": str, "blocked": bool, "reasons": [...]}: if injection is detected, set blocked=True and add "prompt_injection" to reasons; always redact PII into safe_text and, if any PII was found, append a reason starting with "pii_redacted". A clean message passes through untouched and unblocked.

Your turn

Build detect_injection(text) returning {"is_injection", "score", "patterns"} (score = number of injection patterns matched) and guard(user_text) returning {"safe_text", "blocked", "reasons"}. guard blocks (adds reason "prompt_injection") when injection is detected, always redacts PII into safe_text (appending a "pii_redacted..." reason when any is found), and leaves a clean message untouched and unblocked. A benign lookalike like "I previously asked about the weather" must not be blocked.

Spotted a problem in this lesson? Report it

Code · runs in your browser
Output