Human Alignment of Large Language Models through Online Preference
Optimisation
- URL: http://arxiv.org/abs/2403.08635v1
- Date: Wed, 13 Mar 2024 15:47:26 GMT
- Title: Human Alignment of Large Language Models through Online Preference
Optimisation
- Authors: Daniele Calandriello, Daniel Guo, Remi Munos, Mark Rowland, Yunhao
Tang, Bernardo Avila Pires, Pierre Harvey Richemond, Charline Le Lan, Michal
Valko, Tianqi Liu, Rishabh Joshi, Zeyu Zheng, Bilal Piot
- Abstract summary: We show the equivalence between two recent alignment methods, namely Identity Policy optimisation (IPO) and Nash Mirror Descent (Nash-MD)
This equivalence can be proven when we consider the online version of IPO, that is when both generations are sampled by the online policy and annotated by a trained preference model.
We introduce the IPO-MD algorithm that generates data with a mixture policy (between the online and reference policy) similarly as the general Nash-MD algorithm.
- Score: 50.52545798589968
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Ensuring alignment of language models' outputs with human preferences is
critical to guarantee a useful, safe, and pleasant user experience. Thus, human
alignment has been extensively studied recently and several methods such as
Reinforcement Learning from Human Feedback (RLHF), Direct Policy Optimisation
(DPO) and Sequence Likelihood Calibration (SLiC) have emerged. In this paper,
our contribution is two-fold. First, we show the equivalence between two recent
alignment methods, namely Identity Policy Optimisation (IPO) and Nash Mirror
Descent (Nash-MD). Second, we introduce a generalisation of IPO, named IPO-MD,
that leverages the regularised sampling approach proposed by Nash-MD.
This equivalence may seem surprising at first sight, since IPO is an offline
method whereas Nash-MD is an online method using a preference model. However,
this equivalence can be proven when we consider the online version of IPO, that
is when both generations are sampled by the online policy and annotated by a
trained preference model. Optimising the IPO loss with such a stream of data
becomes then equivalent to finding the Nash equilibrium of the preference model
through self-play. Building on this equivalence, we introduce the IPO-MD
algorithm that generates data with a mixture policy (between the online and
reference policy) similarly as the general Nash-MD algorithm. We compare
online-IPO and IPO-MD to different online versions of existing losses on
preference data such as DPO and SLiC on a summarisation task.
Related papers
- Minor DPO reject penalty to increase training robustness [8.971332948872185]
Learning from human preference is a paradigm used in large-scale language model (LLM) fine-tuning step to better align pretrained LLM to human preference for downstream task.
Recently, Direct Preference Optimization (DPO) has been proposed to solve the alignment problem with a simplified RL-free method.
In this article, we analyze the working mechanism of $beta$ in DPO, disclose its syntax difference between RL algorithm and DPO, and understand the potential shortage brought by the DPO simplification.
arXiv Detail & Related papers (2024-08-19T09:29:31Z) - Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization [75.1240295759264]
We propose an effective framework for Bridging and Modeling Correlations in pairwise data, named BMC.
We increase the consistency and informativeness of the pairwise preference signals through targeted modifications.
We identify that DPO alone is insufficient to model these correlations and capture nuanced variations.
arXiv Detail & Related papers (2024-08-14T11:29:47Z) - The Importance of Online Data: Understanding Preference Fine-tuning via Coverage [25.782644676250115]
We study the similarities and differences between online and offline techniques for preference fine-tuning.
We prove that a global coverage condition is both necessary and sufficient for offline contrastive methods to converge to the optimal policy.
We derive a hybrid preference optimization algorithm that uses offline data for contrastive-based preference optimization and online data for KL regularization.
arXiv Detail & Related papers (2024-06-03T15:51:04Z) - D2PO: Discriminator-Guided DPO with Response Evaluation Models [63.71853401569461]
We propose D2PO, discriminator-guided DPO, for the online setting where preferences are being collected throughout learning.
As we collect gold preferences, we use these not only to train our policy, but to train a discriminative response evaluation model to silver-label even more synthetic data for policy training.
We show conditions under which silver labeling is most helpful: it is most effective when training the policy with DPO, outperforming traditional PPO, and benefits from maintaining a separate discriminator from the policy model.
arXiv Detail & Related papers (2024-05-02T17:44:41Z) - From $r$ to $Q^*$: Your Language Model is Secretly a Q-Function [50.812404038684505]
We show that we can derive DPO in the token-level MDP as a general inverse Q-learning algorithm, which satisfies the Bellman equation.
We discuss applications of our work, including information elicitation in multi-turn dialogue, reasoning, agentic applications and end-to-end training of multi-model systems.
arXiv Detail & Related papers (2024-04-18T17:37:02Z) - Token-level Direct Preference Optimization [8.249403373337024]
Fine-tuning pre-trained Large Language Models is essential to align them with human values and intentions.
We introduce Token-level Direct Preference Optimization (TDPO), a novel approach to align LLMs with human preferences by optimizing policy at the token level.
arXiv Detail & Related papers (2024-04-18T08:49:38Z) - RS-DPO: A Hybrid Rejection Sampling and Direct Preference Optimization Method for Alignment of Large Language Models [7.676477609461592]
Reinforcement learning from human feedback (RLHF) has been extensively employed to align large language models with user intent.
DPO relies on contrastive responses generated from human annotator and alternative LLM, instead of the policy model.
In this paper, we address both challenges by systematically combining sampling rejection (RS) and DPO.
Our proposed method effectively fine-tunes LLMs with limited resource environments, leading to improved alignment with user intent.
arXiv Detail & Related papers (2024-02-15T16:00:58Z) - Towards Efficient Exact Optimization of Language Model Alignment [93.39181634597877]
Direct preference optimization (DPO) was proposed to directly optimize the policy from preference data.
We show that DPO derived based on the optimal solution of problem leads to a compromised mean-seeking approximation of the optimal solution in practice.
We propose efficient exact optimization (EXO) of the alignment objective.
arXiv Detail & Related papers (2024-02-01T18:51:54Z) - Nash Learning from Human Feedback [86.09617990412941]
We introduce an alternative pipeline for the fine-tuning of large language models using pairwise human feedback.
We term this approach Nash learning from human feedback (NLHF)
We present a novel algorithmic solution, Nash-MD, founded on the principles of mirror descent.
arXiv Detail & Related papers (2023-12-01T19:26:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.