BPO: Revisiting Preference Modeling in Direct Preference Optimization
- URL: http://arxiv.org/abs/2506.03557v1
- Date: Wed, 04 Jun 2025 04:21:01 GMT
- Title: BPO: Revisiting Preference Modeling in Direct Preference Optimization
- Authors: Lin Sun, Chuang Liu, Peng Liu, Bingyang Li, Weijia Lu, Ning Wu,
- Abstract summary: Direct Preference Optimization (DPO) has emerged as a popular method for aligning Large Language Models (LLMs) with human preferences.<n>DPO effectively preserves the relative ordering between chosen and rejected responses through pairwise ranking losses.<n>It often neglects absolute reward magnitudes, leading to poor performance.<n>We propose Balanced Preference Optimization (BPO), a novel framework that balances the optimization of chosen and rejected responses.
- Score: 13.243174453617064
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Direct Preference Optimization (DPO) have emerged as a popular method for aligning Large Language Models (LLMs) with human preferences. While DPO effectively preserves the relative ordering between chosen and rejected responses through pairwise ranking losses, it often neglects absolute reward magnitudes. This oversight can decrease the likelihood of chosen responses and increase the risk of generating out-of-distribution responses, leading to poor performance. We term this issue Degraded Chosen Responses (DCR).To address this issue, we propose Balanced Preference Optimization (BPO), a novel framework that dynamically balances the optimization of chosen and rejected responses through two key components: balanced reward margin and gap adaptor. Unlike previous methods, BPO can fundamentally resolve DPO's DCR issue, without introducing additional constraints to the loss function. Experimental results on multiple mathematical reasoning tasks show that BPO significantly outperforms DPO, improving accuracy by +10.1% with Llama-3.1-8B-Instruct (18.8% to 28.9%) and +11.7% with Qwen2.5-Math-7B (35.0% to 46.7%). It also surpasses DPO variants by +3.6% over IPO (43.1%), +5.0% over SLiC (41.7%), and +3.1% over Cal-DPO (43.6%) on the same model. Remarkably, our algorithm requires only a single line of code modification, making it simple to implement and fully compatible with existing DPO-based frameworks.
Related papers
- Reverse Preference Optimization for Complex Instruction Following [61.39734201711077]
We propose a simple yet effective method called Reverse Preference Optimization (RPO)<n>It mitigates noise in preference pairs by dynamically reversing the constraints within the instruction to ensure the chosen response is perfect.<n>RPO scales effectively across model sizes, with the 70B RPO model surpassing GPT-4o.
arXiv Detail & Related papers (2025-05-28T09:44:27Z) - Preference Optimization by Estimating the Ratio of the Data Distribution [12.378291609381677]
We propose Bregman preference optimization (BPO) for ratio matching.<n>BPO subsumes DPO as a special case and offers tractable forms for all instances.<n>In experiments, unlike other probabilistic loss extensions such as $f$-DPO or $f$-PO, instances of BPO improve both win rate and entropy compared with DPO.
arXiv Detail & Related papers (2025-05-26T07:10:53Z) - Uncertainty-Penalized Direct Preference Optimization [52.387088396044206]
We develop a pessimistic framework for DPO by introducing preference uncertainty penalization schemes.
The penalization serves as a correction to the loss which attenuates the loss gradient for uncertain samples.
We show improved overall performance compared to vanilla DPO, as well as better completions on prompts from high-uncertainty chosen/rejected responses.
arXiv Detail & Related papers (2024-10-26T14:24:37Z) - $α$-DPO: Adaptive Reward Margin is What Direct Preference Optimization Needs [45.46582930202524]
$alpha$-DPO is an adaptive preference optimization algorithm for large language models.
It balances the policy model and the reference model to achieve personalized reward margins.
It consistently outperforms DPO and SimPO across various model settings.
arXiv Detail & Related papers (2024-10-14T04:29:57Z) - Minor DPO reject penalty to increase training robustness [8.971332948872185]
Learning from human preference is a paradigm used in large-scale language model (LLM) fine-tuning step to better align pretrained LLM to human preference for downstream task.
Recently, Direct Preference Optimization (DPO) has been proposed to solve the alignment problem with a simplified RL-free method.
In this article, we analyze the working mechanism of $beta$ in DPO, disclose its syntax difference between RL algorithm and DPO, and understand the potential shortage brought by the DPO simplification.
arXiv Detail & Related papers (2024-08-19T09:29:31Z) - Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization [75.1240295759264]
We propose an effective framework for Bridging and Modeling Correlations in pairwise data, named BMC.<n>We increase the consistency and informativeness of the pairwise preference signals through targeted modifications.<n>We identify that DPO alone is insufficient to model these correlations and capture nuanced variations.
arXiv Detail & Related papers (2024-08-14T11:29:47Z) - Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs [54.05511925104712]
We propose a simple, effective, and data-efficient method called Step-DPO.
Step-DPO treats individual reasoning steps as units for preference optimization rather than evaluating answers holistically.
Our findings demonstrate that as few as 10K preference data pairs and fewer than 500 Step-DPO training steps can yield a nearly 3% gain in accuracy on MATH for models with over 70B parameters.
arXiv Detail & Related papers (2024-06-26T17:43:06Z) - WPO: Enhancing RLHF with Weighted Preference Optimization [40.07940023654452]
Reinforcement learning from human feedback (RLHF) is a promising solution to align large language models (LLMs) more closely with human values.
Off-policy preference optimization often suffers from a distributional gap between the policy used for data collection and the target policy, leading to suboptimal optimization.
We propose a novel strategy to mitigate this problem by simulating on-policy learning with off-policy preference data.
arXiv Detail & Related papers (2024-06-17T17:59:13Z) - Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level [50.897438358317686]
We show that iLR-DPO can enhance a 7B model to perform on par with GPT-4 without increasing verbosity.
Specifically, our 7B model achieves a $50.5%$ length-controlled win rate against $texttGPT-4 Preview$ on AlpacaEval 2.0.
arXiv Detail & Related papers (2024-06-17T17:55:38Z) - Triple Preference Optimization: Achieving Better Alignment using a Single Step Optimization [34.29965046863887]
Triple Preference Optimization (TPO) is a new preference learning method designed to enhance both reasoning and instruction-following abilities.<n>TPO achieves significant improvements over existing methods without substantially increasing response length across different dataset sizes.
arXiv Detail & Related papers (2024-05-26T20:18:11Z) - Self-Play Preference Optimization for Language Model Alignment [75.83359213697854]
Recent advancements suggest that directly working with preference probabilities can yield a more accurate reflection of human preferences.
We propose a self-play-based method for language model alignment, which treats the problem as a constant-sum two-player game.
Our approach, dubbed Self-Play Preference Optimization (SPPO), utilizes iterative policy updates to provably approximate the Nash equilibrium.
arXiv Detail & Related papers (2024-05-01T17:59:20Z) - Direct Preference Optimization with an Offset [58.7977683502207]
Direct preference optimization (DPO) is a successful strategy for aligning large language models with human preferences.
We propose a generalization of DPO, termed DPO with an offset (ODPO), that does not treat every preference pair equally during fine-tuning.
arXiv Detail & Related papers (2024-02-16T10:55:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.