KL Penalty Control via Perturbation for Direct Preference Optimization
- URL: http://arxiv.org/abs/2502.13177v1
- Date: Tue, 18 Feb 2025 06:44:10 GMT
- Title: KL Penalty Control via Perturbation for Direct Preference Optimization
- Authors: Sangkyu Lee, Janghoon Han, Hosung Song, Stanley Jungkyu Choi, Honglak Lee, Youngjae Yu,
- Abstract summary: We propose $varepsilon$-Direct Preference Optimization ($varepsilon$-DPO), which allows adaptive control of the KL penalty strength $beta$ for each preference pair.<n> Experimental results show that $varepsilon$-DPO outperforms existing direct alignment algorithms.
- Score: 53.67494512877768
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Direct Preference Optimization (DPO) demonstrates the advantage of aligning a large language model with human preference using only an offline dataset. However, DPO has the limitation that the KL penalty, which prevents excessive deviation from the reference model, is static throughout the training process. Several methods try to turn this static KL penalty into a dynamic one, but no approach can adaptively assign different KL penalties for each preference pair. In this paper, we propose $\varepsilon$-Direct Preference Optimization ($\varepsilon$-DPO), which allows adaptive control of the KL penalty strength $\beta$ for each preference pair. Specifically, $\varepsilon$-DPO adaptively controls $\beta$ for each preference pair based on the monotonicity of logits as a preference model under the perturbation of $\beta$ during training by simply reusing the logit of the current policy and the reference policy. Experimental results show that $\varepsilon$-DPO outperforms existing direct alignment algorithms and KL penalty relaxation methods on general chatbot benchmarks, highlighting the significance of adaptive KL penalty relaxation at the instance-level in DPO.
Related papers
- RePO: ReLU-based Preference Optimization [47.87283407390014]
We propose ReLU-based Preference Optimization (RePO), a streamlined algorithm that eliminates $beta$ via two advances.
RePO is characterized as SimPO's limiting case ($beta to infty$), where the logistic weighting collapses to binary thresholding.
Empirical results on AlpacaEval 2 and Arena-Hard show that RePO outperforms DPO and SimPO across multiple base models.
arXiv Detail & Related papers (2025-03-10T15:11:07Z) - $Q\sharp$: Provably Optimal Distributional RL for LLM Post-Training [60.01594991938747]
$Qsharp$ is a value-based algorithm for KL-regularized RL that guides the reference policy using the optimal regularized $Q$ function.
Our results highlight $Qsharp$ as an effective approach for post-training LLMs, offering both improved performance and theoretical guarantees.
arXiv Detail & Related papers (2025-02-27T21:43:00Z) - $α$-DPO: Adaptive Reward Margin is What Direct Preference Optimization Needs [45.46582930202524]
$alpha$-DPO is an adaptive preference optimization algorithm for large language models.
It balances the policy model and the reference model to achieve personalized reward margins.
It consistently outperforms DPO and SimPO across various model settings.
arXiv Detail & Related papers (2024-10-14T04:29:57Z) - Correcting the Mythos of KL-Regularization: Direct Alignment without Overoptimization via Chi-Squared Preference Optimization [78.82586283794886]
$chi2$-Preference Optimization ($chi$PO) is an efficient offline alignment algorithm provably robust to overoptimization.<n>$chi$PO implements the principle of pessimism in the face of uncertainty via regularization.<n>$chi$PO's simplicity and strong guarantees make it the first practical and general-purpose offline alignment algorithm provably robust to overoptimization.
arXiv Detail & Related papers (2024-07-18T11:08:40Z) - Self-Play Preference Optimization for Language Model Alignment [75.83359213697854]
Recent advancements suggest that directly working with preference probabilities can yield a more accurate reflection of human preferences.
We propose a self-play-based method for language model alignment, which treats the problem as a constant-sum two-player game.
Our approach, dubbed Self-Play Preference Optimization (SPPO), utilizes iterative policy updates to provably approximate the Nash equilibrium.
arXiv Detail & Related papers (2024-05-01T17:59:20Z) - Token-level Direct Preference Optimization [8.249403373337024]
Fine-tuning pre-trained Large Language Models is essential to align them with human values and intentions.
We introduce Token-level Direct Preference Optimization (TDPO), a novel approach to align LLMs with human preferences by optimizing policy at the token level.
arXiv Detail & Related papers (2024-04-18T08:49:38Z) - Provably Robust DPO: Aligning Language Models with Noisy Feedback [10.523790076060171]
We introduce a general framework for policy optimization in the presence of random preference flips.
We design a novel loss function, which de-bias the effect of noise on average, making a policy trained by minimizing that loss robust to the noise.
Our experiments on IMDb sentiment generation and Anthropic's helpful-harmless dataset show that rDPO is robust to noise in preference labels compared to vanilla DPO.
arXiv Detail & Related papers (2024-03-01T09:55:18Z) - Direct Preference Optimization with an Offset [58.7977683502207]
Direct preference optimization (DPO) is a successful strategy for aligning large language models with human preferences.
We propose a generalization of DPO, termed DPO with an offset (ODPO), that does not treat every preference pair equally during fine-tuning.
arXiv Detail & Related papers (2024-02-16T10:55:38Z) - Theoretical guarantees on the best-of-n alignment policy [110.21094183592358]
We show that the KL divergence between the best-of-$n$ policy and the reference policy is an upper bound on the actual KL divergence.<n>We also propose a new estimator for the KL divergence and empirically show that it provides a tight approximation.<n>We conclude with analyzing the tradeoffs between win rate and KL divergence of the best-of-$n$ alignment policy.
arXiv Detail & Related papers (2024-01-03T18:39:13Z) - Statistical Rejection Sampling Improves Preference Optimization [42.57245965632205]
We introduce a novel approach to source preference data from the target optimal policy using rejection sampling.
We also propose a unified framework that enhances the loss functions used in both Sequence Likelihood (SLiC) and Direct Preference Optimization (DPO) from a preference modeling standpoint.
arXiv Detail & Related papers (2023-09-13T01:07:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.