Aligning Language Models with Preferences through f-divergence
Minimization
- URL: http://arxiv.org/abs/2302.08215v2
- Date: Tue, 6 Jun 2023 13:09:21 GMT
- Title: Aligning Language Models with Preferences through f-divergence
Minimization
- Authors: Dongyoung Go, Tomasz Korbak, Germ\'an Kruszewski, Jos Rozen, Nahyeon
Ryu, Marc Dymetman
- Abstract summary: f-DPG allows the use of any f-divergence to approximate any target distribution that can be evaluated.
We show that Jensen-Shannon divergence strikes a good balance between these objectives, and frequently outperforms forward KL divergence by a wide margin.
- Score: 4.952674870169772
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Aligning language models with preferences can be posed as approximating a
target distribution representing some desired behavior. Existing approaches
differ both in the functional form of the target distribution and the algorithm
used to approximate it. For instance, Reinforcement Learning from Human
Feedback (RLHF) corresponds to minimizing a reverse KL from an implicit target
distribution arising from a KL penalty in the objective. On the other hand,
Generative Distributional Control (GDC) has an explicit target distribution and
minimizes a forward KL from it using the Distributional Policy Gradient (DPG)
algorithm. In this paper, we propose a new approach, f-DPG, which allows the
use of any f-divergence to approximate any target distribution that can be
evaluated. f-DPG unifies both frameworks (RLHF, GDC) and the approximation
methods (DPG, RL with KL penalties). We show the practical benefits of various
choices of divergence objectives and demonstrate that there is no universally
optimal objective but that different divergences present different alignment
and diversity trade-offs. We show that Jensen-Shannon divergence strikes a good
balance between these objectives, and frequently outperforms forward KL
divergence by a wide margin, leading to significant improvements over prior
work. These distinguishing characteristics between divergences persist as the
model size increases, highlighting the importance of selecting appropriate
divergence objectives.
Related papers
- MaxMin-RLHF: Towards Equitable Alignment of Large Language Models with
Diverse Human Preferences [101.57443597426374]
Reinforcement Learning from Human Feedback (RLHF) aligns language models to human preferences by employing a singular reward model derived from preference data.
We learn a mixture of preference distributions via an expectation-maximization algorithm to better represent diverse human preferences.
Our algorithm achieves an average improvement of more than 16% in win-rates over conventional RLHF algorithms.
arXiv Detail & Related papers (2024-02-14T03:56:27Z) - BRAIn: Bayesian Reward-conditioned Amortized Inference for natural language generation from feedback [30.894025833141537]
High variance of the gradient estimate is the primary reason for the lack of success of these methods.
We generalize the target distribution in DPG, GDC and DPO by using Bayes' rule to define the reward-conditioned posterior.
The resulting approach, referred to as BRAIn, significantly outperforms prior art in summarization and Antropic HH tasks.
arXiv Detail & Related papers (2024-02-04T13:16:29Z) - Distribution-Dependent Rates for Multi-Distribution Learning [26.38831409926518]
Recent multi-distribution learning framework tackles this objective in a dynamic interaction with the environment.
We provide distribution-dependent guarantees in the MDL regime, that scale with suboptimality gaps and result in superior dependence on the sample size.
We devise an adaptive optimistic algorithm, LCB-DR, that showcases enhanced dependence on the gaps, mirroring the contrast between uniform and optimistic allocation in the multi-armed bandit literature.
arXiv Detail & Related papers (2023-12-20T15:50:16Z) - Adaptive importance sampling for heavy-tailed distributions via
$\alpha$-divergence minimization [2.879807093604632]
We propose an AIS algorithm that approximates the target by Student-t proposal distributions.
We adapt location and scale parameters by matching the escort moments of the target and the proposal.
These updates minimize the $alpha$-divergence between the target and the proposal, thereby connecting with variational inference.
arXiv Detail & Related papers (2023-10-25T14:07:08Z) - Value-Distributional Model-Based Reinforcement Learning [59.758009422067]
Quantifying uncertainty about a policy's long-term performance is important to solve sequential decision-making tasks.
We study the problem from a model-based Bayesian reinforcement learning perspective.
We propose Epistemic Quantile-Regression (EQR), a model-based algorithm that learns a value distribution function.
arXiv Detail & Related papers (2023-08-12T14:59:19Z) - A Variational Perspective on Generative Flow Networks [21.97829447881589]
Generative flow networks (GFNs) are models for sequential sampling of composite objects.
We define variational objectives for GFNs in terms of the Kullback-Leibler (KL) divergences between the forward and backward distribution.
arXiv Detail & Related papers (2022-10-14T17:45:59Z) - Score-Based Diffusion meets Annealed Importance Sampling [89.92133671626327]
Annealed Importance Sampling remains one of the most effective methods for marginal likelihood estimation.
We leverage recent progress in score-based generative modeling to approximate the optimal extended target distribution for AIS proposals.
arXiv Detail & Related papers (2022-08-16T12:13:29Z) - Personalized Trajectory Prediction via Distribution Discrimination [78.69458579657189]
Trarimiy prediction is confronted with the dilemma to capture the multi-modal nature of future dynamics.
We present a distribution discrimination (DisDis) method to predict personalized motion patterns.
Our method can be integrated with existing multi-modal predictive models as a plug-and-play module.
arXiv Detail & Related papers (2021-07-29T17:42:12Z) - KL Guided Domain Adaptation [88.19298405363452]
Domain adaptation is an important problem and often needed for real-world applications.
A common approach in the domain adaptation literature is to learn a representation of the input that has the same distributions over the source and the target domain.
We show that with a probabilistic representation network, the KL term can be estimated efficiently via minibatch samples.
arXiv Detail & Related papers (2021-06-14T22:24:23Z) - Implicit Distributional Reinforcement Learning [61.166030238490634]
implicit distributional actor-critic (IDAC) built on two deep generator networks (DGNs)
Semi-implicit actor (SIA) powered by a flexible policy distribution.
We observe IDAC outperforms state-of-the-art algorithms on representative OpenAI Gym environments.
arXiv Detail & Related papers (2020-07-13T02:52:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.