Building DPO Preference Pairs
Supervised fine-tuning needs one gold answer per prompt. Preference tuning (DPO) needs something different and often easier to collect: for the same prompt, a chosen answer and a rejected one, and it learns to prefer the first. You rarely have those pairs directly, but you very often have rankings, a human (or a judge model) ordering several candidate answers best to worst. This lesson turns rankings into the pairwise data DPO actually trains on.
The rule is simple: if candidates are listed best-first, then for every pair where one ranks above another, the higher one is chosen and the lower one is rejected. A ranked list of n candidates yields n*(n-1)/2 preference pairs, every earlier-vs-later combination:
ranked = ["great", "ok", "bad"] # best to worst
# -> (great > ok), (great > bad), (ok > bad) = 3 pairsBuild to_preference_pairs(prompt, ranked) returning a list of {"prompt", "chosen", "rejected"} dicts, one for every index pair i < j in ranked (so ranked[i] is chosen over ranked[j]), in that order. Fewer than two candidates yields an empty list (no preference to express). The same prompt rides on every pair so the trainer knows the context. Press Run to expand a ranking into training pairs.
Write to_preference_pairs(prompt, ranked) where ranked is a list of candidate answers ordered best-first. Return a list of {"prompt": prompt, "chosen": ranked[i], "rejected": ranked[j]} for every i < j (chosen ranks above rejected), in increasing (i, j) order. A list with fewer than two candidates returns [].
This lesson is locked
Lessons open one at a time. Finish the previous lesson to unlock this one.