Mitigating Mismatch within Reference-based Preference Optimization
- URL: http://arxiv.org/abs/2602.11902v1
- Date: Thu, 12 Feb 2026 12:55:51 GMT
- Title: Mitigating Mismatch within Reference-based Preference Optimization
- Authors: Suqin Yuan, Xingrui Yu, Jiyang Zheng, Lei Feng, Dadong Wang, Ivor Tsang, Tongliang Liu,
- Abstract summary: Direct Preference Optimization (DPO) has become the de facto standard for offline preference alignment of large language models.<n>DPO weighs each update relative to a reference, which stabilizes the training by regularizing the updates within a trusted region.<n>This reliance becomes problematic for pessimistic pairs, where the reference model prefers the rejected response.<n>We modify DPO to treat the reference as neutral when it is pessimistic by replacing $_-_mathrmref$ with $_-max0,_mathrmref$.
- Score: 55.07698254211876
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Direct Preference Optimization (DPO) has become the de facto standard for offline preference alignment of large language models, but its reliance on a reference policy introduces a critical tension. DPO weighs each update relative to a reference, which stabilizes the training by regularizing the updates within a trusted region. This reliance becomes problematic for pessimistic pairs, where the reference model prefers the rejected response. For these pairs, DPO prematurely attenuates the gradient as soon as the policy margin ($Δ_θ$) merely beats the reference margin ($Δ_{\mathrm{ref}}$) even if the policy is still wrong ($Δ_θ<0$). We name this failure premature satisfaction, which is a concrete form of the training-inference mismatch. Reference-free objectives remove this mismatch by optimizing the absolute margin, but at the cost of discarding the stabilizing signal of the reference. We mitigate this tension with Hybrid-DPO (HyPO), a drop-in modification to DPO that applies reference conditionally: HyPO behaves exactly like DPO when the reference is optimistic or neutral, and it treats the reference as neutral when it is pessimistic by replacing $Δ_θ-Δ_{\mathrm{ref}}$ with $Δ_θ-\max\{0,Δ_{\mathrm{ref}}\}$. This one-line change strictly strengthens per-example learning signals on pessimistic pairs while preserving DPO's objective form and computational cost. By conditionally debiasing the pessimistic reference signal, HyPO mitigates premature satisfaction; empirically, across preference alignment, HyPO improves inference-aligned metrics and achieves higher pairwise win rates. Our results provide evidence that direct preference alignment could be enhanced by conditionally debiasing the reference signal, rather than discarding it.
Related papers
- Anchored Policy Optimization: Mitigating Exploration Collapse Via Support-Constrained Rectification [14.911955979675772]
We propose Anchored Policy Optimization (APO) to shift the paradigm from global Shape Matching to Support Coverage.<n>APO breaks the accuracy-diversity trade-off, significantly improving Pass@1 while restoring the Pass@K diversity typically lost by standard policy gradient methods.
arXiv Detail & Related papers (2026-02-05T14:41:57Z) - Rethinking the Trust Region in LLM Reinforcement Learning [72.25890308541334]
Proximal Policy Optimization (PPO) serves as the de facto standard algorithm for Large Language Models (LLMs)<n>We propose Divergence Proximal Policy Optimization (DPPO), which substitutes clipping with a more principled constraint.<n>DPPO achieves superior training and efficiency compared to existing methods, offering a more robust foundation for RL-based fine-tuning.
arXiv Detail & Related papers (2026-02-04T18:59:04Z) - Learning Where It Matters: Geometric Anchoring for Robust Preference Alignment [6.428964221372943]
We propose Geometric Anchor Preference Optimization (GAPO), which replaces the fixed reference with a dynamic, geometry-aware anchor.<n>GAPO consistently improves robustness while matching or improving performance on standard LLM alignment and reasoning benchmarks.
arXiv Detail & Related papers (2026-02-04T00:40:21Z) - Coverage Improvement and Fast Convergence of On-policy Preference Learning [67.36750525893514]
Online on-policy preference learning algorithms for language model alignment can significantly outperform their offline counterparts.<n>We analyze how the sampling policy's coverage evolves throughout on-policy training.<n>We develop principled on-policy schemes for reward distillation in the general function class setting.
arXiv Detail & Related papers (2026-01-13T10:46:06Z) - Margin Adaptive DPO: Leveraging Reward Model for Granular Control in Preference Optimization [0.0]
Margin-Adaptive Direct Preference Optimization provides a stable, data-preserving, and instance-level solution.<n>We provide a comprehensive theoretical analysis, proving that MADPO has a well-behaved optimization landscape.<n>It achieves performance gains of up to +33.3% on High Quality data and +10.5% on Low Quality data over the next-best method.
arXiv Detail & Related papers (2025-10-06T20:09:37Z) - Lean and Mean: Decoupled Value Policy Optimization with Global Value Guidance [52.65461207786633]
Policy-based Reinforcement Learning from Human Feedback is essential for aligning large language models with human preferences.<n>It requires joint training of an actor and critic with a pretrained, fixed reward model for guidance.<n>We propose textbfDecoupled Value Policy Optimization (DVPO), a lean framework that replaces traditional reward modeling with a pretrained emphglobal value model (GVM)
arXiv Detail & Related papers (2025-02-24T08:11:33Z) - C2-DPO: Constrained Controlled Direct Preference Optimization [22.730518243326394]
Direct preference optimization (textttDPO) has emerged as a promising approach for solving the alignment problem in AI.<n>We show that textttDPO loss could be derived by starting from an alternative optimization problem that only defines the KL guardrail on in-sample responses.
arXiv Detail & Related papers (2025-02-22T00:38:44Z) - Uncertainty-Penalized Direct Preference Optimization [52.387088396044206]
We develop a pessimistic framework for DPO by introducing preference uncertainty penalization schemes.
The penalization serves as a correction to the loss which attenuates the loss gradient for uncertain samples.
We show improved overall performance compared to vanilla DPO, as well as better completions on prompts from high-uncertainty chosen/rejected responses.
arXiv Detail & Related papers (2024-10-26T14:24:37Z) - AlphaDPO: Adaptive Reward Margin for Direct Preference Optimization [45.46582930202524]
$alpha$-DPO is an adaptive preference optimization algorithm for large language models.<n>It balances the policy model and the reference model to achieve personalized reward margins.<n>It consistently outperforms DPO and SimPO across various model settings.
arXiv Detail & Related papers (2024-10-14T04:29:57Z) - Modulated Intervention Preference Optimization (MIPO): Keep the Easy, Refine the Difficult [0.48951183832371004]
We propose textbfModulated Intervention Preference Optimization (MIPO) to address this issue.
MIPO modulates the degree of intervention from the reference model based on how well the given data is aligned with it.
We compare the performance of MIPO and DPO using Mistral-7B and Llama3-8B in Alpaca Eval 2.0 and MT-Bench.
arXiv Detail & Related papers (2024-09-26T05:24:14Z) - Correcting the Mythos of KL-Regularization: Direct Alignment without Overoptimization via Chi-Squared Preference Optimization [78.82586283794886]
$chi2$-Preference Optimization ($chi$PO) is an efficient offline alignment algorithm provably robust to overoptimization.<n>$chi$PO implements the principle of pessimism in the face of uncertainty via regularization.<n>$chi$PO's simplicity and strong guarantees make it the first practical and general-purpose offline alignment algorithm provably robust to overoptimization.
arXiv Detail & Related papers (2024-07-18T11:08:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.