Beyond Reverse KL: Generalizing Direct Preference Optimization with
Diverse Divergence Constraints
- URL: http://arxiv.org/abs/2309.16240v1
- Date: Thu, 28 Sep 2023 08:29:44 GMT
- Title: Beyond Reverse KL: Generalizing Direct Preference Optimization with
Diverse Divergence Constraints
- Authors: Chaoqi Wang, Yibo Jiang, Chenghao Yang, Han Liu, Yuxin Chen
- Abstract summary: The increasing capabilities of large language models (LLMs) raise opportunities for artificial general intelligence but amplify safety concerns.
RLHF has emerged as a promising pathway towards AI alignment but brings forth challenges due to its complexity and dependence on a separate reward model.
DPO has been proposed as an alternative, and it remains equivalent to RLHF under the reverse KL regularization constraint.
We show that under certain $f$-divergences, including Jensen-Shannon divergence, forward KL divergences and $alpha$-divergences, the complex relationship between the reward and optimal policy can also be simplified
- Score: 26.274786600234876
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The increasing capabilities of large language models (LLMs) raise
opportunities for artificial general intelligence but concurrently amplify
safety concerns, such as potential misuse of AI systems, necessitating
effective AI alignment. Reinforcement Learning from Human Feedback (RLHF) has
emerged as a promising pathway towards AI alignment but brings forth challenges
due to its complexity and dependence on a separate reward model. Direct
Preference Optimization (DPO) has been proposed as an alternative, and it
remains equivalent to RLHF under the reverse KL regularization constraint. This
paper presents $f$-DPO, a generalized approach to DPO by incorporating diverse
divergence constraints. We show that under certain $f$-divergences, including
Jensen-Shannon divergence, forward KL divergences and $\alpha$-divergences, the
complex relationship between the reward and optimal policy can also be
simplified by addressing the Karush-Kuhn-Tucker conditions. This eliminates the
need for estimating the normalizing constant in the Bradley-Terry model and
enables a tractable mapping between the reward function and the optimal policy.
Our approach optimizes LLMs to align with human preferences in a more efficient
and supervised manner under a broad set of divergence constraints. Empirically,
adopting these divergences ensures a balance between alignment performance and
generation diversity. Importantly, $f$-DPO outperforms PPO-based methods in
divergence efficiency, and divergence constraints directly influence expected
calibration error (ECE).
Related papers
- Entropy Controllable Direct Preference Optimization [3.536605202672355]
We propose a simple modification to DPO, H-DPO, which allows for control over the entropy of the resulting policy.
In our experiments, we show that H-DPO outperformed DPO across various tasks, demonstrating superior results in pass@$k$ evaluations for mathematical tasks.
arXiv Detail & Related papers (2024-11-12T07:09:44Z) - SALSA: Soup-based Alignment Learning for Stronger Adaptation in RLHF [22.88031166401938]
This paper presents SALSA, a novel approach designed to overcome limitations by creating a more flexible and better located reference model.
We show that SALSA fosters better exploration, achieving higher rewards and improving model robustness, out-of-distribution, and performance.
arXiv Detail & Related papers (2024-11-04T04:53:43Z) - Hierarchical Preference Optimization: Learning to achieve goals via feasible subgoals prediction [71.81851971324187]
This work introduces Hierarchical Preference Optimization (HPO), a novel approach to hierarchical reinforcement learning (HRL)
HPO addresses non-stationarity and infeasible subgoal generation issues when solving complex robotic control tasks.
Experiments on challenging robotic navigation and manipulation tasks demonstrate impressive performance of HPO, where it shows an improvement of up to 35% over the baselines.
arXiv Detail & Related papers (2024-11-01T04:58:40Z) - $α$-DPO: Adaptive Reward Margin is What Direct Preference Optimization Needs [45.46582930202524]
$alpha$-DPO is an adaptive preference optimization algorithm for large language models.
It balances the policy model and the reference model to achieve personalized reward margins.
It consistently outperforms DPO and SimPO across various model settings.
arXiv Detail & Related papers (2024-10-14T04:29:57Z) - Joint Demonstration and Preference Learning Improves Policy Alignment with Human Feedback [58.049113055986375]
We develop a single stage approach named Alignment with Integrated Human Feedback (AIHF) to train reward models and the policy.
The proposed approach admits a suite of efficient algorithms, which can easily reduce to, and leverage, popular alignment algorithms.
We demonstrate the efficiency of the proposed solutions with extensive experiments involving alignment problems in LLMs and robotic control problems in MuJoCo.
arXiv Detail & Related papers (2024-06-11T01:20:53Z) - Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.
To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.
Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - Diffusion-based Reinforcement Learning via Q-weighted Variational Policy Optimization [55.97310586039358]
Diffusion models have garnered widespread attention in Reinforcement Learning (RL) for their powerful expressiveness and multimodality.
We propose a novel model-free diffusion-based online RL algorithm, Q-weighted Variational Policy Optimization (QVPO)
Specifically, we introduce the Q-weighted variational loss, which can be proved to be a tight lower bound of the policy objective in online RL under certain conditions.
We also develop an efficient behavior policy to enhance sample efficiency by reducing the variance of the diffusion policy during online interactions.
arXiv Detail & Related papers (2024-05-25T10:45:46Z) - Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization [59.758009422067]
We consider the problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning.
We propose a new uncertainty Bellman equation (UBE) whose solution converges to the true posterior variance over values.
We introduce a general-purpose policy optimization algorithm, Q-Uncertainty Soft Actor-Critic (QU-SAC) that can be applied for either risk-seeking or risk-averse policy optimization.
arXiv Detail & Related papers (2023-12-07T15:55:58Z) - Permutation Invariant Policy Optimization for Mean-Field Multi-Agent
Reinforcement Learning: A Principled Approach [128.62787284435007]
We propose the mean-field proximal policy optimization (MF-PPO) algorithm, at the core of which is a permutation-invariant actor-critic neural architecture.
We prove that MF-PPO attains the globally optimal policy at a sublinear rate of convergence.
In particular, we show that the inductive bias introduced by the permutation-invariant neural architecture enables MF-PPO to outperform existing competitors.
arXiv Detail & Related papers (2021-05-18T04:35:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.