Fine-Tuning Language Models with Advantage-Induced Policy Alignment
- URL: http://arxiv.org/abs/2306.02231v3
- Date: Thu, 2 Nov 2023 22:47:14 GMT
- Title: Fine-Tuning Language Models with Advantage-Induced Policy Alignment
- Authors: Banghua Zhu, Hiteshi Sharma, Felipe Vieira Frujeri, Shi Dong,
Chenguang Zhu, Michael I. Jordan, Jiantao Jiao
- Abstract summary: We propose a novel algorithm for aligning large language models to human preferences.
We show that it consistently outperforms PPO in language tasks by a large margin.
We also provide a theoretical justification supporting the design of our loss function.
- Score: 80.96507425217472
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement learning from human feedback (RLHF) has emerged as a reliable
approach to aligning large language models (LLMs) to human preferences. Among
the plethora of RLHF techniques, proximal policy optimization (PPO) is of the
most widely used methods. Despite its popularity, however, PPO may suffer from
mode collapse, instability, and poor sample efficiency. We show that these
issues can be alleviated by a novel algorithm that we refer to as
Advantage-Induced Policy Alignment (APA), which leverages a squared error loss
function based on the estimated advantages. We demonstrate empirically that APA
consistently outperforms PPO in language tasks by a large margin, when a
separate reward model is employed as the evaluator. In addition, compared with
PPO, APA offers a more stable form of control over the deviation from the
model's initial policy, ensuring that the model improves its performance
without collapsing to deterministic output. In addition to empirical results,
we also provide a theoretical justification supporting the design of our loss
function.
Related papers
- Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF [82.7679132059169]
Reinforcement learning from human feedback has emerged as a central tool for language model alignment.
We propose a new algorithm for online exploration in RLHF, Exploratory Preference Optimization (XPO)
XPO enjoys the strongest known provable guarantees and promising empirical performance.
arXiv Detail & Related papers (2024-05-31T17:39:06Z) - Direct Alignment of Language Models via Quality-Aware Self-Refinement [31.845241241178982]
We investigate the use of intrinsic knowledge within the on-the-fly fine-tuning LLM to obtain relative qualities and help to refine the loss function.
We show that the constructed refinement function can help self-refine the loss function under mild assumptions.
Experiments indicate that they can improve the performance of the fine-tuned models over DPO and IPO.
arXiv Detail & Related papers (2024-05-31T17:31:18Z) - Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.
To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.
Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - DPO Meets PPO: Reinforced Token Optimization for RLHF [36.97894955691627]
We introduce a framework that models RLHF problems as a Markov decision process (MDP)
Under this framework, we introduce an algorithm, dubbed as Reinforced Token Optimization (textttRTO), which learns the token-wise reward function from preference data.
For its practical implementation, textttRTO innovatively integrates Direct Preference Optimization (DPO) and Proximal Policy Optimization.
arXiv Detail & Related papers (2024-04-29T17:58:30Z) - Disentangling Length from Quality in Direct Preference Optimization [93.74831404396174]
Reinforcement Learning from Human Feedback (RLHF) has been a crucial component in the recent success of Large Language Models.
RLHF is know to exploit biases in human preferences, such as verbosity.
We develop a principled but simple regularization strategy that prevents length exploitation, while still maintaining improvements in model quality.
arXiv Detail & Related papers (2024-03-28T06:03:47Z) - Statistical Rejection Sampling Improves Preference Optimization [42.57245965632205]
We introduce a novel approach to source preference data from the target optimal policy using rejection sampling.
We also propose a unified framework that enhances the loss functions used in both Sequence Likelihood (SLiC) and Direct Preference Optimization (DPO) from a preference modeling standpoint.
arXiv Detail & Related papers (2023-09-13T01:07:25Z) - Secrets of RLHF in Large Language Models Part I: PPO [81.01936993929127]
Large language models (LLMs) have formulated a blueprint for the advancement of artificial general intelligence.
reinforcement learning with human feedback (RLHF) emerges as the pivotal technological paradigm underpinning this pursuit.
In this report, we dissect the framework of RLHF, re-evaluate the inner workings of PPO, and explore how the parts comprising PPO algorithms impact policy agent training.
arXiv Detail & Related papers (2023-07-11T01:55:24Z) - Minimax Model Learning [42.65032356835701]
We present a novel off-policy loss function for learning a transition model in model-based reinforcement learning.
Our loss is derived from the off-policy policy evaluation objective with an emphasis on correcting distribution shift.
arXiv Detail & Related papers (2021-03-02T23:16:36Z) - Logistic Q-Learning [87.00813469969167]
We propose a new reinforcement learning algorithm derived from a regularized linear-programming formulation of optimal control in MDPs.
The main feature of our algorithm is a convex loss function for policy evaluation that serves as a theoretically sound alternative to the widely used squared Bellman error.
arXiv Detail & Related papers (2020-10-21T17:14:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.