Deterministic Train/Val Split
Once your dataset is clean you have to split it: most rows go to train (what the model learns from) and a held-out slice goes to validation (what you score it on). The held-out set is sacred. If an example leaks into both halves, your validation number is a lie, because the model was tested on something it trained on. So the split must be a clean partition: no overlap, nothing lost.
It must also be reproducible. "It scored 0.81" is meaningless if rerunning the script reshuffles into a different split and gives 0.79. The fix is the same one you use everywhere randomness shows up in ML: seed the generator. The same seed always produces the same split, so your experiment is repeatable and a teammate can reproduce your exact numbers.
Note we shuffle indices, not the rows themselves, and we pass an explicit seed to a private generator instead of touching the global one:
import random
def split_dataset(rows, val_frac, seed):
indices = list(range(len(rows)))
rng = random.Random(seed) # a private RNG, does not disturb global state
rng.shuffle(indices)
n_val = int(round(len(rows) * val_frac))
...Use random.Random(seed) (a fresh, isolated generator), not random.seed(...) on the global module, so calling your function does not silently change randomness elsewhere in a program.
Build split_dataset(rows, val_frac, seed) that returns {"train": [...], "val": [...]} where:
- the validation size is
int(round(len(rows) * val_frac)), and train is the rest; - train and val are disjoint and together contain every original row exactly once;
- the SAME seed always yields the SAME split, and a different seed generally yields a different one.
This deterministic prep is everything the GPU run depends on. The actual training (LoRA adapters, an optimizer, a loss curve) happens on hardware this sandbox does not have. Press Run to split a list and confirm the two halves reconstruct the whole.
Write split_dataset(rows, val_frac, seed) that returns {"train": [...], "val": [...]}. Shuffle a list of indices with random.Random(seed) (not the global RNG), take the first int(round(len(rows) * val_frac)) as validation and the rest as train. The two halves must be disjoint, must together contain every original row exactly once, and must be reproducible for a given seed.
This lesson is locked
Lessons open one at a time. Finish the previous lesson to unlock this one.