Related papers: Understanding Reference Policies in Direct Preference Optimization

Understanding Reference Policies in Direct Preference Optimization

URL: http://arxiv.org/abs/2407.13709v2
Date: Thu, 22 Aug 2024 17:56:15 GMT
Title: Understanding Reference Policies in Direct Preference Optimization
Authors: Yixin Liu, Pengfei Liu, Arman Cohan,
Abstract summary: Direct Preference Optimization (DPO) has become a widely used training method for the instruction fine-tuning of large language models (LLMs) This work explores an under-investigated aspect of DPO - its dependency on the reference model or policy.
Score: 50.67309013764383
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Direct Preference Optimization (DPO) has become a widely used training method for the instruction fine-tuning of large language models (LLMs). In this work, we explore an under-investigated aspect of DPO - its dependency on the reference model or policy. Such reference policies, typically instantiated as the model to be further fine-tuned, are important since they can impose an upper limit on DPO's effectiveness. Therefore, we address three related research questions in this work. First, we explore the optimal strength of the KL divergence constraint in DPO, which penalizes deviations from the reference policy, and find that DPO is sensitive to this strength. Next, we examine the necessity of the KL-constraint from the reference policies in DPO by providing both theoretical and empirical comparisons between DPO and related learning objectives, demonstrating DPO's superiority in this controlled setting. Additionally, we investigate whether DPO benefits from stronger reference policies, finding that a stronger reference policy can lead to improved performance, but only when it is similar to the model being fine-tuned. Our findings highlight the confounding role of reference policies in DPO and offer insights for best practices, while also identifying open research questions for future studies.

Related papers

Rethinking the Trust Region in LLM Reinforcement Learning [72.25890308541334]
Proximal Policy Optimization (PPO) serves as the de facto standard algorithm for Large Language Models (LLMs)<n>We propose Divergence Proximal Policy Optimization (DPPO), which substitutes clipping with a more principled constraint.<n>DPPO achieves superior training and efficiency compared to existing methods, offering a more robust foundation for RL-based fine-tuning.
arXiv Detail & Related papers (2026-02-04T18:59:04Z)
Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO [51.22869332661607]
We decompose the performance gap between reinforcement learning from human feedback and direct preference optimization under a representation gap.<n>We show that RLHF, DPO, or online DPO can outperform one another depending on the type of model mis-specifications.
arXiv Detail & Related papers (2025-05-26T09:54:02Z)
A Survey of Direct Preference Optimization [103.59317151002693]
Large Language Models (LLMs) have demonstrated unprecedented generative capabilities. Their alignment with human values remains critical for ensuring helpful and harmless deployments. Direct Preference Optimization (DPO) has recently gained prominence as a streamlined alternative.
arXiv Detail & Related papers (2025-03-12T08:45:15Z)
Entropy Controllable Direct Preference Optimization [3.536605202672355]
We propose a simple modification to DPO, H-DPO, which allows for control over the entropy of the resulting policy. In our experiments, we show that H-DPO outperformed DPO across various tasks, demonstrating superior results in pass@$k$ evaluations for mathematical tasks.
arXiv Detail & Related papers (2024-11-12T07:09:44Z)
A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications [52.42860559005861]
Direct Preference Optimization (DPO) has emerged as a promising approach for alignment. Despite DPO's various advancements and inherent limitations, an in-depth review of these aspects is currently lacking in the literature.
arXiv Detail & Related papers (2024-10-21T02:27:24Z)
On Extending Direct Preference Optimization to Accommodate Ties [27.23138831535272]
Two DPO variants explicitly model the possibility of declaring a tie in pair-wise comparisons.<n>We replace the Bradley-Terry model in DPO with two well-known modeling extensions.<n>Experiments in neural machine translation and summarization show that explicitly labeled ties can be added to datasets.
arXiv Detail & Related papers (2024-09-25T23:38:15Z)
Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization [75.1240295759264]
We propose an effective framework for Bridging and Modeling Correlations in pairwise data, named BMC. We increase the consistency and informativeness of the pairwise preference signals through targeted modifications. We identify that DPO alone is insufficient to model these correlations and capture nuanced variations.
arXiv Detail & Related papers (2024-08-14T11:29:47Z)
3D-Properties: Identifying Challenges in DPO and Charting a Path Forward [17.27880657597116]
We revisit DPO, analyzing its theoretical foundations and empirical performance. We identify three key properties, termed 3D properties, that emerge from DPO's learning process. We propose simple regularization techniques that improve training stability and performance.
arXiv Detail & Related papers (2024-06-11T14:59:24Z)
Reflective Policy Optimization [20.228281670899204]
Reflective Policy Optimization (RPO) amalgamates past and future state-action information for policy optimization. RPO empowers the agent for introspection, allowing modifications to its actions within the current state. Empirical results demonstrate RPO's feasibility and efficacy in two reinforcement learning benchmarks.
arXiv Detail & Related papers (2024-06-06T01:46:49Z)
D2PO: Discriminator-Guided DPO with Response Evaluation Models [63.71853401569461]
We propose D2PO, discriminator-guided DPO, for the online setting where preferences are being collected throughout learning. As we collect gold preferences, we use these not only to train our policy, but to train a discriminative response evaluation model to silver-label even more synthetic data for policy training. We show conditions under which silver labeling is most helpful: it is most effective when training the policy with DPO, outperforming traditional PPO, and benefits from maintaining a separate discriminator from the policy model.
arXiv Detail & Related papers (2024-05-02T17:44:41Z)
From $r$ to $Q^*$: Your Language Model is Secretly a Q-Function [50.812404038684505]
We show that we can derive DPO in the token-level MDP as a general inverse Q-learning algorithm, which satisfies the Bellman equation. We discuss applications of our work, including information elicitation in multi-turn dialogue, reasoning, agentic applications and end-to-end training of multi-model systems.
arXiv Detail & Related papers (2024-04-18T17:37:02Z)
Towards Analyzing and Understanding the Limitations of DPO: A Theoretical Perspective [25.34250859820326]
We provide an analytical framework using the field theory to analyze the optimization process of DPO. We find that the DPO loss function decreases the probability of producing human dispreferred data at a faster rate than it increases the probability of producing preferred data.
arXiv Detail & Related papers (2024-04-06T13:24:37Z)
Fine-Tuning Language Models with Advantage-Induced Policy Alignment [80.96507425217472]
We propose a novel algorithm for aligning large language models to human preferences. We show that it consistently outperforms PPO in language tasks by a large margin. We also provide a theoretical justification supporting the design of our loss function.
arXiv Detail & Related papers (2023-06-04T01:59:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.