fine-tuningalignmenttraining

Is DPO actually replacing RLHF for preference alignment, or are they solving different problems?

Research Scientist · AI lab, series B·Asked Mar 13, 2026·174 views

We need to align a fine-tuned model to our specific quality preferences — citation accuracy, hedging on uncertain claims, preferred response length. RLHF is expensive to run and needs a reward model. DPO is simpler but we've heard it can be unstable on small preference datasets. In practice, what alignment method are teams using for domain-specific alignment on a relatively small preference dataset (500–2000 pairs)?

Is DPO actually replacing RLHF for preference alignment, or are they solving different problems?

7 Answers