Direct Preference Optimization vs. RLHF

https://news.ycombinator.com/rss Hits: 5

Summary

We're excited to announce that the Together Fine-Tuning Platform now supports Direct Preference Optimization (DPO)! This technique allows developers to align language models with human preferences creating more helpful, accurate, and tailored AI assistants. In this deep-dive blogpost, we provide details of what DPO is, how it works, when to use it and code examples. If you'd like to jump straight into code have a look at our code notebook.Tuning LLMs on Preference DataModern language model development typically follows a three-stage process:Pre-training on internet-scale data to build a foundation model with broad knowledgeSupervised fine-tuning (SFT) on specific high-quality examples to adapt a model to a particular knowledge domain or task‍Preference-based learning to refine the model based on human preferencesSource. Great talk about this here: State of GPT - Karpathy TalkThis final stage, preference learning, is where DPO comes in as an alternative to Reinforcement Learning from Human Feedback (RLHF). It ensures that models not only perform tasks correctly but do so in ways that users prefer. It also allows users to teach the model nuances of a particular use case by showing examples of what is expected and what the model should avoid. Business use cases where you might employ DPO are to improve: HelpfulnessToneTruthfulnessHarmlessnessInstruction-followingPreference tuning shapes the model's generation quality and alignment with human and business values.What is Direct Preference Optimization?DPO is a method for aligning language models with human preferences without using reinforcement learning (RL). Unlike traditional approaches, DPO allows you to train language models directly on preference data consisting of:A prompt or instructionA preferred (chosen) responseAn unpreferred (rejected) responseFor example, you might have a dataset entry like this: { "input": { "messages": [ { "role": "assistant", "content": "Hello, how can I assist you today?" }, { "role": "u...

First seen: 2025-05-28 01:58

Last seen: 2025-05-28 05:58

Read Full Article More from this Source

Direct Preference Optimization vs. RLHF

Summary

Related News

Introducing Sulka, the Hardened Yocto Distro

Mumps (Programming Language)

Radio pulses detected coming from ice in Antarctica

Simulink (Matlab) Copilot

Ask HN: How do I give back to people helped me when I was young and had nothing?