Syllabus Lesson 197 of 239 · Fine-Tuning, Conceptually
Fine-Tuning, Conceptually

Building DPO Preference Pairs

Supervised fine-tuning needs one gold answer per prompt. Preference tuning (DPO) needs something different and often easier to collect: for the same prompt, a chosen answer and a rejected one, and it learns to prefer the first. You rarely have those pairs directly, but you very often have rankings, a human (or a judge model) ordering several candidate answers best to worst. This lesson turns rankings into the pairwise data DPO actually trains on.

The rule is simple: if candidates are listed best-first, then for every pair where one ranks above another, the higher one is chosen and the lower one is rejected. A ranked list of n candidates yields n*(n-1)/2 preference pairs, every earlier-vs-later combination:

ranked = ["great", "ok", "bad"]   # best to worst
# -> (great > ok), (great > bad), (ok > bad)   = 3 pairs

Build to_preference_pairs(prompt, ranked) returning a list of {"prompt", "chosen", "rejected"} dicts, one for every index pair i < j in ranked (so ranked[i] is chosen over ranked[j]), in that order. Fewer than two candidates yields an empty list (no preference to express). The same prompt rides on every pair so the trainer knows the context. Press Run to expand a ranking into training pairs.

Your turn

Write to_preference_pairs(prompt, ranked) where ranked is a list of candidate answers ordered best-first. Return a list of {"prompt": prompt, "chosen": ranked[i], "rejected": ranked[j]} for every i < j (chosen ranks above rejected), in increasing (i, j) order. A list with fewer than two candidates returns [].

Spotted a problem in this lesson? Report it

Code · runs in your browser
Output