Syllabus Lesson 198 of 239 · Fine-Tuning, Conceptually
Fine-Tuning, Conceptually

Deterministic Train/Val Split

Once your dataset is clean you have to split it: most rows go to train (what the model learns from) and a held-out slice goes to validation (what you score it on). The held-out set is sacred. If an example leaks into both halves, your validation number is a lie, because the model was tested on something it trained on. So the split must be a clean partition: no overlap, nothing lost.

It must also be reproducible. "It scored 0.81" is meaningless if rerunning the script reshuffles into a different split and gives 0.79. The fix is the same one you use everywhere randomness shows up in ML: seed the generator. The same seed always produces the same split, so your experiment is repeatable and a teammate can reproduce your exact numbers.

Note we shuffle indices, not the rows themselves, and we pass an explicit seed to a private generator instead of touching the global one:

import random

def split_dataset(rows, val_frac, seed):
    indices = list(range(len(rows)))
    rng = random.Random(seed)   # a private RNG, does not disturb global state
    rng.shuffle(indices)
    n_val = int(round(len(rows) * val_frac))
    ...

Use random.Random(seed) (a fresh, isolated generator), not random.seed(...) on the global module, so calling your function does not silently change randomness elsewhere in a program.

Build split_dataset(rows, val_frac, seed) that returns {"train": [...], "val": [...]} where:

  • the validation size is int(round(len(rows) * val_frac)), and train is the rest;
  • train and val are disjoint and together contain every original row exactly once;
  • the SAME seed always yields the SAME split, and a different seed generally yields a different one.

This deterministic prep is everything the GPU run depends on. The actual training (LoRA adapters, an optimizer, a loss curve) happens on hardware this sandbox does not have. Press Run to split a list and confirm the two halves reconstruct the whole.

Your turn

Write split_dataset(rows, val_frac, seed) that returns {"train": [...], "val": [...]}. Shuffle a list of indices with random.Random(seed) (not the global RNG), take the first int(round(len(rows) * val_frac)) as validation and the rest as train. The two halves must be disjoint, must together contain every original row exactly once, and must be reproducible for a given seed.

Spotted a problem in this lesson? Report it

Code · runs in your browser
Output