A density estimation perspective on learning from pairwise human
preferences
- URL: http://arxiv.org/abs/2311.14115v3
- Date: Wed, 10 Jan 2024 16:11:32 GMT
- Title: A density estimation perspective on learning from pairwise human
preferences
- Authors: Vincent Dumoulin, Daniel D. Johnson, Pablo Samuel Castro, Hugo
Larochelle, Yann Dauphin
- Abstract summary: We show that for a family of generative processes defined via preference behavior distribution equations, training a reward function on pairwise preferences effectively models an annotator's implicit preference distribution.
We discuss and present findings on "annotator misspecification" -- failure cases where wrong modeling assumptions are made about annotator behavior, resulting in poorly-adapted models.
- Score: 32.64330423345252
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Learning from human feedback (LHF) -- and in particular learning from
pairwise preferences -- has recently become a crucial ingredient in training
large language models (LLMs), and has been the subject of much research. Most
recent works frame it as a reinforcement learning problem, where a reward
function is learned from pairwise preference data and the LLM is treated as a
policy which is adapted to maximize the rewards, often under additional
regularization constraints. We propose an alternative interpretation which
centers on the generative process for pairwise preferences and treats LHF as a
density estimation problem. We provide theoretical and empirical results
showing that for a family of generative processes defined via preference
behavior distribution equations, training a reward function on pairwise
preferences effectively models an annotator's implicit preference distribution.
Finally, we discuss and present findings on "annotator misspecification" --
failure cases where wrong modeling assumptions are made about annotator
behavior, resulting in poorly-adapted models -- suggesting that approaches that
learn from pairwise human preferences could have trouble learning from a
population of annotators with diverse viewpoints.
Related papers
- A Survey on Human Preference Learning for Large Language Models [81.41868485811625]
The recent surge of versatile large language models (LLMs) largely depends on aligning increasingly capable foundation models with human intentions by preference learning.
This survey covers the sources and formats of preference feedback, the modeling and usage of preference signals, as well as the evaluation of the aligned LLMs.
arXiv Detail & Related papers (2024-06-17T03:52:51Z) - RLHF from Heterogeneous Feedback via Personalization and Preference Aggregation [24.374185140811115]
Reinforcement learning from human feedback (RLHF) has been an effective technique for aligning AI systems with human values.
In this paper, we focus on addressing the issues due to the inherent heterogeneity in human preferences, as well as their potential strategic behavior in providing feedback.
We propose two frameworks to address heterogeneous human feedback in principled ways: personalization-based one and aggregation-based one.
arXiv Detail & Related papers (2024-04-30T23:57:23Z) - Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data [102.16105233826917]
Learning from preference labels plays a crucial role in fine-tuning large language models.
There are several distinct approaches for preference fine-tuning, including supervised learning, on-policy reinforcement learning (RL), and contrastive learning.
arXiv Detail & Related papers (2024-04-22T17:20:18Z) - MaxMin-RLHF: Towards Equitable Alignment of Large Language Models with
Diverse Human Preferences [101.57443597426374]
Reinforcement Learning from Human Feedback (RLHF) aligns language models to human preferences by employing a singular reward model derived from preference data.
We learn a mixture of preference distributions via an expectation-maximization algorithm to better represent diverse human preferences.
Our algorithm achieves an average improvement of more than 16% in win-rates over conventional RLHF algorithms.
arXiv Detail & Related papers (2024-02-14T03:56:27Z) - Secrets of RLHF in Large Language Models Part II: Reward Modeling [134.97964938009588]
We introduce a series of novel methods to mitigate the influence of incorrect and ambiguous preferences in the dataset.
We also introduce contrastive learning to enhance the ability of reward models to distinguish between chosen and rejected responses.
arXiv Detail & Related papers (2024-01-11T17:56:59Z) - Improving Heterogeneous Model Reuse by Density Estimation [105.97036205113258]
This paper studies multiparty learning, aiming to learn a model using the private data of different participants.
Model reuse is a promising solution for multiparty learning, assuming that a local model has been trained for each party.
arXiv Detail & Related papers (2023-05-23T09:46:54Z) - Models of human preference for learning reward functions [80.39289349661364]
We learn the reward function from human-generated preferences between pairs of trajectory segments.
We find this assumption to be flawed and propose modeling human preferences as informed by each segment's regret.
Our proposed regret preference model better predicts real human preferences and also learns reward functions from these preferences that lead to policies that are better human-aligned.
arXiv Detail & Related papers (2022-06-05T17:58:02Z) - Optimal Model Averaging: Towards Personalized Collaborative Learning [0.0]
In federated learning, differences in the data or objectives between the participating nodes motivate approaches to train a personalized machine learning model for each node.
One such approach is weighted averaging between a locally trained model and the global model.
We find that there is always some positive amount of model averaging that reduces the expected squared error compared to the local model.
arXiv Detail & Related papers (2021-10-25T13:33:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.