Models of human preference for learning reward functions
- URL: http://arxiv.org/abs/2206.02231v3
- Date: Wed, 6 Sep 2023 21:13:28 GMT
- Title: Models of human preference for learning reward functions
- Authors: W. Bradley Knox, Stephane Hatgis-Kessell, Serena Booth, Scott Niekum,
Peter Stone, Alessandro Allievi
- Abstract summary: We learn the reward function from human-generated preferences between pairs of trajectory segments.
We find this assumption to be flawed and propose modeling human preferences as informed by each segment's regret.
Our proposed regret preference model better predicts real human preferences and also learns reward functions from these preferences that lead to policies that are better human-aligned.
- Score: 80.39289349661364
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The utility of reinforcement learning is limited by the alignment of reward
functions with the interests of human stakeholders. One promising method for
alignment is to learn the reward function from human-generated preferences
between pairs of trajectory segments, a type of reinforcement learning from
human feedback (RLHF). These human preferences are typically assumed to be
informed solely by partial return, the sum of rewards along each segment. We
find this assumption to be flawed and propose modeling human preferences
instead as informed by each segment's regret, a measure of a segment's
deviation from optimal decision-making. Given infinitely many preferences
generated according to regret, we prove that we can identify a reward function
equivalent to the reward function that generated those preferences, and we
prove that the previous partial return model lacks this identifiability
property in multiple contexts. We empirically show that our proposed regret
preference model outperforms the partial return preference model with finite
training data in otherwise the same setting. Additionally, we find that our
proposed regret preference model better predicts real human preferences and
also learns reward functions from these preferences that lead to policies that
are better human-aligned. Overall, this work establishes that the choice of
preference model is impactful, and our proposed regret preference model
provides an improvement upon a core assumption of recent research. We have open
sourced our experimental code, the human preferences dataset we gathered, and
our training and preference elicitation interfaces for gathering a such a
dataset.
Related papers
- Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback [87.37721254914476]
We introduce a routing framework that combines inputs from humans and LMs to achieve better annotation quality.
We train a performance prediction model to predict a reward model's performance on an arbitrary combination of human and LM annotations.
We show that the selected hybrid mixture achieves better reward model performance compared to using either one exclusively.
arXiv Detail & Related papers (2024-10-24T20:04:15Z) - LRHP: Learning Representations for Human Preferences via Preference Pairs [45.056558199304554]
We introduce a preference representation learning task that aims to construct a richer and more structured representation of human preferences.
We verify the utility of preference representations in two downstream tasks: preference data selection and preference margin prediction.
arXiv Detail & Related papers (2024-10-06T14:48:28Z) - General Preference Modeling with Preference Representations for Aligning Language Models [51.14207112118503]
We introduce preference representation learning, an approach that embeds responses into a latent space to capture intricate preference structures efficiently.
We also propose preference score-based General Preference Optimization (GPO), which generalizes reward-based reinforcement learning from human feedback.
Our method may enhance the alignment of foundation models with nuanced human values.
arXiv Detail & Related papers (2024-10-03T04:22:55Z) - Robust Preference Optimization through Reward Model Distillation [68.65844394615702]
Language model (LM) post-training involves maximizing a reward function that is derived from preference annotations.
DPO is a popular offline alignment method that trains a policy directly on preference data without the need to train a reward model or apply reinforcement learning.
We analyze this phenomenon and propose distillation to get a better proxy for the true preference distribution over generation pairs.
arXiv Detail & Related papers (2024-05-29T17:39:48Z) - Secrets of RLHF in Large Language Models Part II: Reward Modeling [134.97964938009588]
We introduce a series of novel methods to mitigate the influence of incorrect and ambiguous preferences in the dataset.
We also introduce contrastive learning to enhance the ability of reward models to distinguish between chosen and rejected responses.
arXiv Detail & Related papers (2024-01-11T17:56:59Z) - A density estimation perspective on learning from pairwise human
preferences [32.64330423345252]
We show that for a family of generative processes defined via preference behavior distribution equations, training a reward function on pairwise preferences effectively models an annotator's implicit preference distribution.
We discuss and present findings on "annotator misspecification" -- failure cases where wrong modeling assumptions are made about annotator behavior, resulting in poorly-adapted models.
arXiv Detail & Related papers (2023-11-23T17:20:36Z) - Learning Optimal Advantage from Preferences and Mistaking it for Reward [43.58066500250688]
Most recent work assumes that human preferences are generated based only upon the reward accrued within those segments, or their partial return.
We investigate the consequences of assuming preferences are based upon partial return when they actually arise from regret.
This paper overall provides insight regarding why learning under the partial return preference model tends to work so well in practice, despite it conforming poorly to how humans give preferences.
arXiv Detail & Related papers (2023-10-03T21:58:24Z) - Batch Reinforcement Learning from Crowds [24.717084423091865]
A shortcoming of batch reinforcement learning is its requirement for rewards in data.
Existing settings for lack of reward, such as behavioral cloning, rely on optimal demonstrations collected from humans.
This paper addresses the lack of reward in a batch reinforcement learning setting by learning a reward function from preferences.
arXiv Detail & Related papers (2021-11-08T05:46:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.