Beyond Reverse KL: Generalizing Direct Preference Optimization with
Diverse Divergence Constraints
- URL: http://arxiv.org/abs/2309.16240v1
- Date: Thu, 28 Sep 2023 08:29:44 GMT
- Title: Beyond Reverse KL: Generalizing Direct Preference Optimization with
Diverse Divergence Constraints
- Authors: Chaoqi Wang, Yibo Jiang, Chenghao Yang, Han Liu, Yuxin Chen
- Abstract summary: The increasing capabilities of large language models (LLMs) raise opportunities for artificial general intelligence but amplify safety concerns.
RLHF has emerged as a promising pathway towards AI alignment but brings forth challenges due to its complexity and dependence on a separate reward model.
DPO has been proposed as an alternative, and it remains equivalent to RLHF under the reverse KL regularization constraint.
We show that under certain $f$-divergences, including Jensen-Shannon divergence, forward KL divergences and $alpha$-divergences, the complex relationship between the reward and optimal policy can also be simplified
- Score: 26.274786600234876
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The increasing capabilities of large language models (LLMs) raise
opportunities for artificial general intelligence but concurrently amplify
safety concerns, such as potential misuse of AI systems, necessitating
effective AI alignment. Reinforcement Learning from Human Feedback (RLHF) has
emerged as a promising pathway towards AI alignment but brings forth challenges
due to its complexity and dependence on a separate reward model. Direct
Preference Optimization (DPO) has been proposed as an alternative, and it
remains equivalent to RLHF under the reverse KL regularization constraint. This
paper presents $f$-DPO, a generalized approach to DPO by incorporating diverse
divergence constraints. We show that under certain $f$-divergences, including
Jensen-Shannon divergence, forward KL divergences and $\alpha$-divergences, the
complex relationship between the reward and optimal policy can also be
simplified by addressing the Karush-Kuhn-Tucker conditions. This eliminates the
need for estimating the normalizing constant in the Bradley-Terry model and
enables a tractable mapping between the reward function and the optimal policy.
Our approach optimizes LLMs to align with human preferences in a more efficient
and supervised manner under a broad set of divergence constraints. Empirically,
adopting these divergences ensures a balance between alignment performance and
generation diversity. Importantly, $f$-DPO outperforms PPO-based methods in
divergence efficiency, and divergence constraints directly influence expected
calibration error (ECE).
Related papers
Err
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.