Related papers: Entropy Controllable Direct Preference Optimization

Entropy Controllable Direct Preference Optimization

URL: http://arxiv.org/abs/2411.07595v1
Date: Tue, 12 Nov 2024 07:09:44 GMT
Title: Entropy Controllable Direct Preference Optimization
Authors: Motoki Omura, Yasuhiro Fujita, Toshiki Kataoka,
Abstract summary: We propose a simple modification to DPO, H-DPO, which allows for control over the entropy of the resulting policy. In our experiments, we show that H-DPO outperformed DPO across various tasks, demonstrating superior results in pass@$k$ evaluations for mathematical tasks.
Score: 3.536605202672355
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In the post-training of large language models (LLMs), Reinforcement Learning from Human Feedback (RLHF) is an effective approach to achieve generation aligned with human preferences. Direct Preference Optimization (DPO) allows for policy training with a simple binary cross-entropy loss without a reward model. The objective of DPO is regularized by reverse KL divergence that encourages mode-seeking fitting to the reference policy. Nonetheless, we indicate that minimizing reverse KL divergence could fail to capture a mode of the reference distribution, which may hurt the policy's performance. Based on this observation, we propose a simple modification to DPO, H-DPO, which allows for control over the entropy of the resulting policy, enhancing the distribution's sharpness and thereby enabling mode-seeking fitting more effectively. In our experiments, we show that H-DPO outperformed DPO across various tasks, demonstrating superior results in pass@$k$ evaluations for mathematical tasks. Moreover, H-DPO is simple to implement, requiring only minor modifications to the loss calculation of DPO, which makes it highly practical and promising for wide-ranging applications in the training of LLMs.

Related papers

Explicit Preference Optimization: No Need for an Implicit Reward Model [18.225409932618657]
Direct preference optimization (DPO) and its offshoots circumvent the need for a separate reward training step.<n>We show that DPO-based objectives are nonetheless subject to sub-optimal regularization and counter-intuitive artifacts.
arXiv Detail & Related papers (2025-06-09T07:11:01Z)
Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO [51.22869332661607]
We decompose the performance gap between reinforcement learning from human feedback and direct preference optimization under a representation gap.<n>We show that RLHF, DPO, or online DPO can outperform one another depending on the type of model mis-specifications.
arXiv Detail & Related papers (2025-05-26T09:54:02Z)
On the Effect of Negative Gradient in Group Relative Deep Reinforcement Optimization [52.76330545825083]
Reinforcement learning (RL) has become popular in enhancing the reasoning capabilities of large language models (LLMs)<n>We identify a previously unrecognized phenomenon we term Lazy Likelihood Displacement (LLD), wherein the likelihood of correct responses marginally increases or even decreases during training.<n>We develop a method called NTHR, which downweights penalties on tokens contributing to the LLD. Unlike prior DPO-based approaches, NTHR takes advantage of GRPO's group-based structure, using correct responses as anchors to identify influential tokens.
arXiv Detail & Related papers (2025-05-24T18:58:51Z)
A Simple and Effective Reinforcement Learning Method for Text-to-Image Diffusion Fine-tuning [61.403275660120606]
Reinforcement learning (RL)-based fine-tuning has emerged as a powerful approach for aligning diffusion models with black-box objectives. We propose leave-one-out PPO (LOOP), a novel RL for diffusion fine-tuning method. Our results demonstrate that LOOP effectively improves diffusion models on various black-box objectives, and achieves a better balance between computational efficiency and performance.
arXiv Detail & Related papers (2025-03-02T13:43:53Z)
Gradient Imbalance in Direct Preference Optimization [26.964127989679596]
We propose Balanced-DPO, a simple yet effective modification to the DPO objective that introduces a computationally efficient gradient reweighting mechanism. Our experiments demonstrate the effectiveness of Balanced-DPO, validating the theoretical findings and confirming that addressing gradient imbalance is key to improving DPO's performance.
arXiv Detail & Related papers (2025-02-28T08:47:03Z)
Hierarchical Preference Optimization: Learning to achieve goals via feasible subgoals prediction [71.81851971324187]
This work introduces Hierarchical Preference Optimization (HPO), a novel approach to hierarchical reinforcement learning (HRL) HPO addresses non-stationarity and infeasible subgoal generation issues when solving complex robotic control tasks. Experiments on challenging robotic navigation and manipulation tasks demonstrate impressive performance of HPO, where it shows an improvement of up to 35% over the baselines.
arXiv Detail & Related papers (2024-11-01T04:58:40Z)
Uncertainty-Penalized Direct Preference Optimization [52.387088396044206]
We develop a pessimistic framework for DPO by introducing preference uncertainty penalization schemes. The penalization serves as a correction to the loss which attenuates the loss gradient for uncertain samples. We show improved overall performance compared to vanilla DPO, as well as better completions on prompts from high-uncertainty chosen/rejected responses.
arXiv Detail & Related papers (2024-10-26T14:24:37Z)
Simultaneous Reward Distillation and Preference Learning: Get You a Language Model Who Can Do Both [6.102274021710727]
This paper introduces DRDO (Direct Reward Distillation and policy-Optimization), which simultaneously models rewards and preferences to avoid degeneracy. Results on the Ultrafeedback and TL;DR datasets demonstrate that DRDO-trained policies surpass methods such as DPO and e-DPO in terms of expected rewards.
arXiv Detail & Related papers (2024-10-11T02:19:11Z)
Minor DPO reject penalty to increase training robustness [8.971332948872185]
Learning from human preference is a paradigm used in large-scale language model (LLM) fine-tuning step to better align pretrained LLM to human preference for downstream task. Recently, Direct Preference Optimization (DPO) has been proposed to solve the alignment problem with a simplified RL-free method. In this article, we analyze the working mechanism of $beta$ in DPO, disclose its syntax difference between RL algorithm and DPO, and understand the potential shortage brought by the DPO simplification.
arXiv Detail & Related papers (2024-08-19T09:29:31Z)
Direct Alignment of Language Models via Quality-Aware Self-Refinement [31.845241241178982]
We investigate the use of intrinsic knowledge within the on-the-fly fine-tuning LLM to obtain relative qualities and help to refine the loss function. We show that the constructed refinement function can help self-refine the loss function under mild assumptions. Experiments indicate that they can improve the performance of the fine-tuned models over DPO and IPO.
arXiv Detail & Related papers (2024-05-31T17:31:18Z)
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences. To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model. Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z)
Diffusion-based Reinforcement Learning via Q-weighted Variational Policy Optimization [55.97310586039358]
Diffusion models have garnered widespread attention in Reinforcement Learning (RL) for their powerful expressiveness and multimodality. We propose a novel model-free diffusion-based online RL algorithm, Q-weighted Variational Policy Optimization (QVPO) Specifically, we introduce the Q-weighted variational loss, which can be proved to be a tight lower bound of the policy objective in online RL under certain conditions. We also develop an efficient behavior policy to enhance sample efficiency by reducing the variance of the diffusion policy during online interactions.
arXiv Detail & Related papers (2024-05-25T10:45:46Z)
D2PO: Discriminator-Guided DPO with Response Evaluation Models [63.71853401569461]
We propose D2PO, discriminator-guided DPO, for the online setting where preferences are being collected throughout learning. As we collect gold preferences, we use these not only to train our policy, but to train a discriminative response evaluation model to silver-label even more synthetic data for policy training. We show conditions under which silver labeling is most helpful: it is most effective when training the policy with DPO, outperforming traditional PPO, and benefits from maintaining a separate discriminator from the policy model.
arXiv Detail & Related papers (2024-05-02T17:44:41Z)
RS-DPO: A Hybrid Rejection Sampling and Direct Preference Optimization Method for Alignment of Large Language Models [7.676477609461592]
Reinforcement learning from human feedback (RLHF) has been extensively employed to align large language models with user intent. DPO relies on contrastive responses generated from human annotator and alternative LLM, instead of the policy model. In this paper, we address both challenges by systematically combining sampling rejection (RS) and DPO. Our proposed method effectively fine-tunes LLMs with limited resource environments, leading to improved alignment with user intent.
arXiv Detail & Related papers (2024-02-15T16:00:58Z)
Statistical Rejection Sampling Improves Preference Optimization [42.57245965632205]
We introduce a novel approach to source preference data from the target optimal policy using rejection sampling. We also propose a unified framework that enhances the loss functions used in both Sequence Likelihood (SLiC) and Direct Preference Optimization (DPO) from a preference modeling standpoint.
arXiv Detail & Related papers (2023-09-13T01:07:25Z)
Fine-Tuning Language Models with Advantage-Induced Policy Alignment [80.96507425217472]
We propose a novel algorithm for aligning large language models to human preferences. We show that it consistently outperforms PPO in language tasks by a large margin. We also provide a theoretical justification supporting the design of our loss function.
arXiv Detail & Related papers (2023-06-04T01:59:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.