D2PO: Discriminator-Guided DPO with Response Evaluation Models
- URL: http://arxiv.org/abs/2405.01511v2
- Date: Wed, 7 Aug 2024 03:46:30 GMT
- Title: D2PO: Discriminator-Guided DPO with Response Evaluation Models
- Authors: Prasann Singhal, Nathan Lambert, Scott Niekum, Tanya Goyal, Greg Durrett,
- Abstract summary: We propose D2PO, discriminator-guided DPO, for the online setting where preferences are being collected throughout learning.
As we collect gold preferences, we use these not only to train our policy, but to train a discriminative response evaluation model to silver-label even more synthetic data for policy training.
We show conditions under which silver labeling is most helpful: it is most effective when training the policy with DPO, outperforming traditional PPO, and benefits from maintaining a separate discriminator from the policy model.
- Score: 63.71853401569461
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Varied approaches for aligning language models have been proposed, including supervised fine-tuning, RLHF, and direct optimization methods such as DPO. Although DPO has rapidly gained popularity due to its straightforward training process and competitive results, there is an open question of whether there remain practical advantages of using a discriminator, like a reward model, to evaluate responses. We propose D2PO, discriminator-guided DPO, an approach for the online setting where preferences are being collected throughout learning. As we collect gold preferences, we use these not only to train our policy, but to train a discriminative response evaluation model to silver-label even more synthetic data for policy training. We explore this approach across a set of diverse tasks, including a realistic chat setting, we find that our approach leads to higher-quality outputs compared to DPO with the same data budget, and greater efficiency in terms of preference data requirements. Furthermore, we show conditions under which silver labeling is most helpful: it is most effective when training the policy with DPO, outperforming traditional PPO, and benefits from maintaining a separate discriminator from the policy model.
Related papers
- Uncertainty-Penalized Direct Preference Optimization [52.387088396044206]
We develop a pessimistic framework for DPO by introducing preference uncertainty penalization schemes.
The penalization serves as a correction to the loss which attenuates the loss gradient for uncertain samples.
We show improved overall performance compared to vanilla DPO, as well as better completions on prompts from high-uncertainty chosen/rejected responses.
arXiv Detail & Related papers (2024-10-26T14:24:37Z) - TIS-DPO: Token-level Importance Sampling for Direct Preference Optimization With Estimated Weights [73.9088920210495]
We propose a token-level importance sampling DPO objective named TIS-DPO that assigns importance weights to each token based on its reward.
TIS-DPO significantly outperforms various baseline methods on harmlessness and helpfulness alignment and summarization tasks.
arXiv Detail & Related papers (2024-10-06T04:03:00Z) - Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization [75.1240295759264]
We propose an effective framework for Bridging and Modeling Correlations in pairwise data, named BMC.
We increase the consistency and informativeness of the pairwise preference signals through targeted modifications.
We identify that DPO alone is insufficient to model these correlations and capture nuanced variations.
arXiv Detail & Related papers (2024-08-14T11:29:47Z) - Understanding Reference Policies in Direct Preference Optimization [50.67309013764383]
Direct Preference Optimization (DPO) has become a widely used training method for the instruction fine-tuning of large language models (LLMs)
This work explores an under-investigated aspect of DPO - its dependency on the reference model or policy.
arXiv Detail & Related papers (2024-07-18T17:08:10Z) - Human Alignment of Large Language Models through Online Preference
Optimisation [50.52545798589968]
We show the equivalence between two recent alignment methods, namely Identity Policy optimisation (IPO) and Nash Mirror Descent (Nash-MD)
This equivalence can be proven when we consider the online version of IPO, that is when both generations are sampled by the online policy and annotated by a trained preference model.
We introduce the IPO-MD algorithm that generates data with a mixture policy (between the online and reference policy) similarly as the general Nash-MD algorithm.
arXiv Detail & Related papers (2024-03-13T15:47:26Z) - RS-DPO: A Hybrid Rejection Sampling and Direct Preference Optimization Method for Alignment of Large Language Models [7.676477609461592]
Reinforcement learning from human feedback (RLHF) has been extensively employed to align large language models with user intent.
DPO relies on contrastive responses generated from human annotator and alternative LLM, instead of the policy model.
In this paper, we address both challenges by systematically combining sampling rejection (RS) and DPO.
Our proposed method effectively fine-tunes LLMs with limited resource environments, leading to improved alignment with user intent.
arXiv Detail & Related papers (2024-02-15T16:00:58Z) - Policy Optimization in RLHF: The Impact of Out-of-preference Data [17.126977660436225]
This paper examines two popular alignment methods: Direct Preference Optimization (DPO) and Reward-Model-Based Policy Optimization (RMB-PO)
A variant of RMB-PO, referred to as RMB-PO+ is also considered.
In particular, compared with DPO, RMB-PO additionally uses policy-generated data, and RMB-PO+ further leverages new, preference-free data.
arXiv Detail & Related papers (2023-12-17T02:14:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.