Displacement-Resistant Extensions of DPO with Nonconvex $f$-Divergences
- URL: http://arxiv.org/abs/2602.06788v1
- Date: Fri, 06 Feb 2026 15:45:37 GMT
- Title: Displacement-Resistant Extensions of DPO with Nonconvex $f$-Divergences
- Authors: Idan Pipano, Shoham Sabach, Kavosh Asadi, Mohammad Ghavamzadeh,
- Abstract summary: DPO and related algorithms align language models by directly optimizing the RLHF objective.<n>We show that DPO-inducing characterizes when the RLHF problem remains tractable.<n>We then focus on a specific DPO-inducing and displacement-resistant $f$, leading to our novel SquaredPO loss.
- Score: 23.894803166231792
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: DPO and related algorithms align language models by directly optimizing the RLHF objective: find a policy that maximizes the Bradley-Terry reward while staying close to a reference policy through a KL divergence penalty. Previous work showed that this approach could be further generalized: the original problem remains tractable even if the KL divergence is replaced by a family of $f$-divergence with a convex generating function $f$. Our first contribution is to show that convexity of $f$ is not essential. Instead, we identify a more general condition, referred to as DPO-inducing, that precisely characterizes when the RLHF problem remains tractable. Our next contribution is to establish a second condition on $f$ that is necessary to prevent probability displacement, a known empirical phenomenon in which the probabilities of the winner and the loser responses approach zero. We refer to any $f$ that satisfies this condition as displacement-resistant. We finally focus on a specific DPO-inducing and displacement-resistant $f$, leading to our novel SquaredPO loss. Compared to DPO, this new loss offers stronger theoretical guarantees while performing competitively in practice.
Related papers
- Mitigating Mismatch within Reference-based Preference Optimization [55.07698254211876]
Direct Preference Optimization (DPO) has become the de facto standard for offline preference alignment of large language models.<n>DPO weighs each update relative to a reference, which stabilizes the training by regularizing the updates within a trusted region.<n>This reliance becomes problematic for pessimistic pairs, where the reference model prefers the rejected response.<n>We modify DPO to treat the reference as neutral when it is pessimistic by replacing $_-_mathrmref$ with $_-max0,_mathrmref$.
arXiv Detail & Related papers (2026-02-12T12:55:51Z) - GIFT: Group-relative Implicit Fine Tuning Integrates GRPO with DPO and UNA [6.07907277934348]
GIFT is a novel reinforcement learning framework for alignings.<n>It minimizes discrepancy between implicit and explicit reward models.<n>It achieves superior reasoning and alignment performance on mathematical benchmarks.
arXiv Detail & Related papers (2025-10-27T21:18:19Z) - Rethinking KL Regularization in RLHF: From Value Estimation to Gradient Optimization [6.136585583991053]
Reinforcement Learning from Human Feedback (RLHF) leverages a Kullback-Leibler (KL) divergence loss to stabilize training and prevent overfitting.<n>In methods such as GRPO, its implementation may be guided by principles from numerical value estimation.
arXiv Detail & Related papers (2025-10-02T01:00:02Z) - RePO: Understanding Preference Learning Through ReLU-Based Optimization [66.098833436503]
We propose ReLU-based Preference Optimization (RePO), a streamlined algorithm that eliminates $beta$ via two advances.<n>RePO is characterized as SimPO's limiting case ($beta to infty$), where the logistic weighting collapses to binary thresholding.<n> Empirical results on AlpacaEval 2 and Arena-Hard show that RePO outperforms DPO and SimPO across multiple base models.
arXiv Detail & Related papers (2025-03-10T15:11:07Z) - Reinforcement Learning with Verifiable Rewards: GRPO's Effective Loss, Dynamics, and Success Amplification [10.617854230082896]
Group Relative Policy Optimization was introduced and used recently for promoting reasoning in LLMs under verifiable (binary) rewards.<n>We analyze variants that differ in reward normalization (mean-only vs mean + variance) and in how they regularize updates using KL divergence.
arXiv Detail & Related papers (2025-03-09T14:36:45Z) - C2-DPO: Constrained Controlled Direct Preference Optimization [22.730518243326394]
Direct preference optimization (textttDPO) has emerged as a promising approach for solving the alignment problem in AI.<n>We show that textttDPO loss could be derived by starting from an alternative optimization problem that only defines the KL guardrail on in-sample responses.
arXiv Detail & Related papers (2025-02-22T00:38:44Z) - Towards a Sharp Analysis of Offline Policy Learning for $f$-Divergence-Regularized Contextual Bandits [49.96531901205305]
We analyze $f$-divergence-regularized offline policy learning.<n>For reverse Kullback-Leibler (KL) divergence, we give the first $tildeO(epsilon-1)$ sample complexity under single-policy concentrability.<n>We extend our analysis to dueling bandits, and we believe these results take a significant step toward a comprehensive understanding of $f$-divergence-regularized policy learning.
arXiv Detail & Related papers (2025-02-09T22:14:45Z) - From $r$ to $Q^*$: Your Language Model is Secretly a Q-Function [50.812404038684505]
We show that we can derive DPO in the token-level MDP as a general inverse Q-learning algorithm, which satisfies the Bellman equation.
We discuss applications of our work, including information elicitation in multi-turn dialogue, reasoning, agentic applications and end-to-end training of multi-model systems.
arXiv Detail & Related papers (2024-04-18T17:37:02Z) - Theoretical guarantees on the best-of-n alignment policy [110.21094183592358]
We show that the KL divergence between the best-of-$n$ policy and the reference policy is an upper bound on the actual KL divergence.<n>We propose a new estimator for the KL divergence and empirically show that it provides a tight approximation.<n>We conclude with analyzing the tradeoffs between win rate and KL divergence of the best-of-$n$ alignment policy.
arXiv Detail & Related papers (2024-01-03T18:39:13Z) - Provably Efficient Safe Exploration via Primal-Dual Policy Optimization [105.7510838453122]
We study the Safe Reinforcement Learning (SRL) problem using the Constrained Markov Decision Process (CMDP) formulation.
We present an provably efficient online policy optimization algorithm for CMDP with safe exploration in the function approximation setting.
arXiv Detail & Related papers (2020-03-01T17:47:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.