Curating an Instruction-Tuning Dataset
Let us be honest about what this module can and cannot do. Fine-tuning a model is a GPU job. You take a base model, show it thousands of example pairs, and run gradient descent to nudge its weights. That needs a training loop, a GPU, and hours of compute. None of that runs in this browser sandbox (there is no torch, no GPU, no network). So we will not pretend otherwise.
What we WILL teach is the part an AI engineer actually spends most of their time on, and the part that decides whether the expensive GPU run produces anything good: the dataset. A famous result in the field (LIMA, "Less Is More for Alignment") showed that around 1,000 carefully curated examples beat a much larger pile of noisy ones. Garbage in, garbage out is not a slogan here, it is the whole game. You build the dataset that feeds the GPU.
An instruction-tuning record is just a small dict. The common shape is an instruction (what the user asks), an optional input (extra context), and the gold output (what a good model should reply):
{"instruction": "Summarize this review in one sentence.",
"input": "The battery lasts forever but the screen is dim.",
"output": "Great battery life, but the screen is too dim."}Before any of this reaches a trainer you have to clean it. Real exports are full of rows with a missing output, a blank instruction that is just whitespace, a wrong type where a string was expected, and exact duplicates that waste compute and skew the model toward repeated examples. Your job is a deterministic filter.
Build two functions.
validate_pair(example)->Trueonly whenexampleis a dict whoseinstructionANDoutputare both strings that are non-empty after.strip(). Anything else (missing key, non-string, blank or whitespace-only) returnsFalse. A missinginputis fine, it is optional.clean_dataset(rows)-> drop every invalid pair, then drop exact duplicates (sameinstruction, sameinput, sameoutputafter stripping), keeping the FIRST occurrence. Return{"kept": [...], "stats": {"input", "kept", "dropped_invalid", "dropped_duplicate"}}.
An invalid row counts as dropped_invalid; a valid-but-repeated row counts as dropped_duplicate. The counts must reconcile: input == kept + dropped_invalid + dropped_duplicate.
In a real pipeline the cleaned rows would then be tokenized and streamed to a trainer. That step is concept-only here, but it is worth knowing the landscape your dataset feeds into. Full fine-tuning updates every weight in the model: most accurate, most expensive, and it produces a whole new copy of the model. LoRA / PEFT (parameter-efficient fine-tuning) freezes the base model and trains only small low-rank adapters that ride on top, so the run fits on one GPU and you can swap adapters in and out of one shared base. And beyond plain supervised fine-tuning there is preference tuning, which aligns tone and safety from human preferences rather than gold answers: DPO (Direct Preference Optimization) trains directly on chosen-vs-rejected answer pairs and is a simpler, more stable alternative to the older RLHF/PPO reward-model loop. Press Run to clean a small messy dataset and print the stats.
Write validate_pair(example) returning True only when example is a dict whose instruction and output are both non-empty (after .strip()) strings; everything else returns False (a missing input is allowed). Then write clean_dataset(rows) that drops invalid pairs and exact duplicates (same stripped instruction/input/output, keeping the first), returning {"kept": [...], "stats": {"input", "kept", "dropped_invalid", "dropped_duplicate"}}.
This lesson is locked
Lessons open one at a time. Finish the previous lesson to unlock this one.