Policy Optimization in RLHF: The Impact of Out-of-preference Data
- URL: http://arxiv.org/abs/2312.10584v2
- Date: Sun, 25 Feb 2024 19:15:26 GMT
- Title: Policy Optimization in RLHF: The Impact of Out-of-preference Data
- Authors: Ziniu Li, Tian Xu, Yang Yu
- Abstract summary: This paper examines two popular alignment methods: Direct Preference Optimization (DPO) and Reward-Model-Based Policy Optimization (RMB-PO)
A variant of RMB-PO, referred to as RMB-PO+ is also considered.
In particular, compared with DPO, RMB-PO additionally uses policy-generated data, and RMB-PO+ further leverages new, preference-free data.
- Score: 17.126977660436225
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Aligning intelligent agents with human preferences and values is important.
This paper examines two popular alignment methods: Direct Preference
Optimization (DPO) and Reward-Model-Based Policy Optimization (RMB-PO). A
variant of RMB-PO, referred to as RMB-PO+ is also considered. These methods,
either explicitly or implicitly, learn a reward model from preference data and
differ in the data used for policy optimization to unlock the generalization
ability of the reward model. In particular, compared with DPO, RMB-PO
additionally uses policy-generated data, and RMB-PO+ further leverages new,
preference-free data. We examine the impact of such out-of-preference data. Our
study, conducted through controlled and synthetic experiments, demonstrates
that DPO performs poorly, whereas RMB-PO+ performs the best. In particular,
even when providing the policy model with a good feature representation, we
find that policy optimization with adequate out-of-preference data
significantly improves performance by harnessing the reward model's
generalization capabilities.
Related papers
- Reward-Augmented Data Enhances Direct Preference Alignment of LLMs [56.24431208419858]
We introduce reward-conditioned Large Language Models (LLMs) that learn from the entire spectrum of response quality within the dataset.
We propose an effective yet simple data relabeling method that conditions the preference pairs on quality scores to construct a reward-augmented dataset.
arXiv Detail & Related papers (2024-10-10T16:01:51Z) - Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization [75.1240295759264]
We propose an effective framework for Bridging and Modeling Correlations in pairwise data, named BMC.
We increase the consistency and informativeness of the pairwise preference signals through targeted modifications.
We identify that DPO alone is insufficient to model these correlations and capture nuanced variations.
arXiv Detail & Related papers (2024-08-14T11:29:47Z) - WPO: Enhancing RLHF with Weighted Preference Optimization [40.07940023654452]
Reinforcement learning from human feedback (RLHF) is a promising solution to align large language models (LLMs) more closely with human values.
Off-policy preference optimization often suffers from a distributional gap between the policy used for data collection and the target policy, leading to suboptimal optimization.
We propose a novel strategy to mitigate this problem by simulating on-policy learning with off-policy preference data.
arXiv Detail & Related papers (2024-06-17T17:59:13Z) - Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.
To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.
Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - D2PO: Discriminator-Guided DPO with Response Evaluation Models [63.71853401569461]
We propose D2PO, discriminator-guided DPO, for the online setting where preferences are being collected throughout learning.
As we collect gold preferences, we use these not only to train our policy, but to train a discriminative response evaluation model to silver-label even more synthetic data for policy training.
We show conditions under which silver labeling is most helpful: it is most effective when training the policy with DPO, outperforming traditional PPO, and benefits from maintaining a separate discriminator from the policy model.
arXiv Detail & Related papers (2024-05-02T17:44:41Z) - Towards Efficient Exact Optimization of Language Model Alignment [93.39181634597877]
Direct preference optimization (DPO) was proposed to directly optimize the policy from preference data.
We show that DPO derived based on the optimal solution of problem leads to a compromised mean-seeking approximation of the optimal solution in practice.
We propose efficient exact optimization (EXO) of the alignment objective.
arXiv Detail & Related papers (2024-02-01T18:51:54Z) - Preference as Reward, Maximum Preference Optimization with Importance Sampling [3.7040071165219595]
We propose a simple and intuitive off-policy preference optimization algorithm from an importance sampling view, which we call Maximum Preference Optimization (MPO)
MPO achieves the best of both worlds by combining the objectives of RLHF and IPO while being an off-policy algorithm.
arXiv Detail & Related papers (2023-12-27T06:34:54Z) - Statistical Rejection Sampling Improves Preference Optimization [42.57245965632205]
We introduce a novel approach to source preference data from the target optimal policy using rejection sampling.
We also propose a unified framework that enhances the loss functions used in both Sequence Likelihood (SLiC) and Direct Preference Optimization (DPO) from a preference modeling standpoint.
arXiv Detail & Related papers (2023-09-13T01:07:25Z) - On Effective Scheduling of Model-based Reinforcement Learning [53.027698625496015]
We propose a framework named AutoMBPO to automatically schedule the real data ratio.
In this paper, we first theoretically analyze the role of real data in policy training, which suggests that gradually increasing the ratio of real data yields better performance.
arXiv Detail & Related papers (2021-11-16T15:24:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.