Human Alignment of Large Language Models through Online Preference
Optimisation
- URL: http://arxiv.org/abs/2403.08635v1
- Date: Wed, 13 Mar 2024 15:47:26 GMT
- Title: Human Alignment of Large Language Models through Online Preference
Optimisation
- Authors: Daniele Calandriello, Daniel Guo, Remi Munos, Mark Rowland, Yunhao
Tang, Bernardo Avila Pires, Pierre Harvey Richemond, Charline Le Lan, Michal
Valko, Tianqi Liu, Rishabh Joshi, Zeyu Zheng, Bilal Piot
- Abstract summary: We show the equivalence between two recent alignment methods, namely Identity Policy optimisation (IPO) and Nash Mirror Descent (Nash-MD)
This equivalence can be proven when we consider the online version of IPO, that is when both generations are sampled by the online policy and annotated by a trained preference model.
We introduce the IPO-MD algorithm that generates data with a mixture policy (between the online and reference policy) similarly as the general Nash-MD algorithm.
- Score: 50.52545798589968
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Ensuring alignment of language models' outputs with human preferences is
critical to guarantee a useful, safe, and pleasant user experience. Thus, human
alignment has been extensively studied recently and several methods such as
Reinforcement Learning from Human Feedback (RLHF), Direct Policy Optimisation
(DPO) and Sequence Likelihood Calibration (SLiC) have emerged. In this paper,
our contribution is two-fold. First, we show the equivalence between two recent
alignment methods, namely Identity Policy Optimisation (IPO) and Nash Mirror
Descent (Nash-MD). Second, we introduce a generalisation of IPO, named IPO-MD,
that leverages the regularised sampling approach proposed by Nash-MD.
This equivalence may seem surprising at first sight, since IPO is an offline
method whereas Nash-MD is an online method using a preference model. However,
this equivalence can be proven when we consider the online version of IPO, that
is when both generations are sampled by the online policy and annotated by a
trained preference model. Optimising the IPO loss with such a stream of data
becomes then equivalent to finding the Nash equilibrium of the preference model
through self-play. Building on this equivalence, we introduce the IPO-MD
algorithm that generates data with a mixture policy (between the online and
reference policy) similarly as the general Nash-MD algorithm. We compare
online-IPO and IPO-MD to different online versions of existing losses on
preference data such as DPO and SLiC on a summarisation task.
Related papers
- The Importance of Online Data: Understanding Preference Fine-tuning via Coverage [25.782644676250115]
We study the similarities and differences between online and offline techniques for preference fine-tuning.
We prove that a global coverage condition is both necessary and sufficient for offline contrastive methods to converge to the optimal policy.
We derive a hybrid preference optimization algorithm that uses offline data for contrastive-based preference optimization and online data for KL regularization.
arXiv Detail & Related papers (2024-06-03T15:51:04Z) - D2PO: Discriminator-Guided DPO with Response Evaluation Models [63.71853401569461]
We propose D2PO, discriminator-guided DPO, for the online setting where preferences are being collected throughout learning.
As we collect gold preferences, we use these not only to train our policy, but to train a discriminative response evaluation model to silver-label even more synthetic data for policy training.
We show conditions under which silver labeling is most helpful: it is most effective when training the policy with DPO, outperforming traditional PPO, and benefits from maintaining a separate discriminator from the policy model.
arXiv Detail & Related papers (2024-05-02T17:44:41Z) - Self-Play Preference Optimization for Language Model Alignment [75.83359213697854]
Recent advancements suggest that directly working with preference probabilities can yield a more accurate reflection of human preferences.
We propose a self-play-based method for language model alignment, which treats the problem as a constant-sum two-player game.
Our approach, dubbed Self-Play Preference Optimization (SPPO), approximates the Nash equilibrium through iterative policy updates.
arXiv Detail & Related papers (2024-05-01T17:59:20Z) - From $r$ to $Q^*$: Your Language Model is Secretly a Q-Function [50.812404038684505]
Reinforcement Learning From Human Feedback (RLHF) has been a critical to the success of the latest generation of generative AI models.
Direct Preference Optimization (DPO) has emerged as an alternative approach.
DPO solves the same objective as the standard RLHF setup, but there is a mismatch between the two approaches.
arXiv Detail & Related papers (2024-04-18T17:37:02Z) - Token-level Direct Preference Optimization [8.249403373337024]
Fine-tuning pre-trained Large Language Models is essential to align them with human values and intentions.
We introduce Token-level Direct Preference Optimization (TDPO), a novel approach to align LLMs with human preferences by optimizing policy at the token level.
arXiv Detail & Related papers (2024-04-18T08:49:38Z) - RS-DPO: A Hybrid Rejection Sampling and Direct Preference Optimization Method for Alignment of Large Language Models [7.676477609461592]
Reinforcement learning from human feedback (RLHF) has been extensively employed to align large language models with user intent.
DPO relies on contrastive responses generated from human annotator and alternative LLM, instead of the policy model.
In this paper, we address both challenges by systematically combining sampling rejection (RS) and DPO.
Our proposed method effectively fine-tunes LLMs with limited resource environments, leading to improved alignment with user intent.
arXiv Detail & Related papers (2024-02-15T16:00:58Z) - Towards Efficient Exact Optimization of Language Model Alignment [93.39181634597877]
Direct preference optimization (DPO) was proposed to directly optimize the policy from preference data.
We show that DPO derived based on the optimal solution of problem leads to a compromised mean-seeking approximation of the optimal solution in practice.
We propose efficient exact optimization (EXO) of the alignment objective.
arXiv Detail & Related papers (2024-02-01T18:51:54Z) - Nash Learning from Human Feedback [86.09617990412941]
We introduce an alternative pipeline for the fine-tuning of large language models using pairwise human feedback.
We term this approach Nash learning from human feedback (NLHF)
We present a novel algorithmic solution, Nash-MD, founded on the principles of mirror descent.
arXiv Detail & Related papers (2023-12-01T19:26:23Z) - Statistical Rejection Sampling Improves Preference Optimization [42.57245965632205]
We introduce a novel approach to source preference data from the target optimal policy using rejection sampling.
We also propose a unified framework that enhances the loss functions used in both Sequence Likelihood (SLiC) and Direct Preference Optimization (DPO) from a preference modeling standpoint.
arXiv Detail & Related papers (2023-09-13T01:07:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.