Learning Optimal Advantage from Preferences and Mistaking it for Reward
- URL: http://arxiv.org/abs/2310.02456v1
- Date: Tue, 3 Oct 2023 21:58:24 GMT
- Title: Learning Optimal Advantage from Preferences and Mistaking it for Reward
- Authors: W. Bradley Knox, Stephane Hatgis-Kessell, Sigurdur Orn Adalgeirsson,
Serena Booth, Anca Dragan, Peter Stone, Scott Niekum
- Abstract summary: Most recent work assumes that human preferences are generated based only upon the reward accrued within those segments, or their partial return.
We investigate the consequences of assuming preferences are based upon partial return when they actually arise from regret.
This paper overall provides insight regarding why learning under the partial return preference model tends to work so well in practice, despite it conforming poorly to how humans give preferences.
- Score: 43.58066500250688
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider algorithms for learning reward functions from human preferences
over pairs of trajectory segments, as used in reinforcement learning from human
feedback (RLHF). Most recent work assumes that human preferences are generated
based only upon the reward accrued within those segments, or their partial
return. Recent work casts doubt on the validity of this assumption, proposing
an alternative preference model based upon regret. We investigate the
consequences of assuming preferences are based upon partial return when they
actually arise from regret. We argue that the learned function is an
approximation of the optimal advantage function, $\hat{A^*_r}$, not a reward
function. We find that if a specific pitfall is addressed, this incorrect
assumption is not particularly harmful, resulting in a highly shaped reward
function. Nonetheless, this incorrect usage of $\hat{A^*_r}$ is less desirable
than the appropriate and simpler approach of greedy maximization of
$\hat{A^*_r}$. From the perspective of the regret preference model, we also
provide a clearer interpretation of fine tuning contemporary large language
models with RLHF. This paper overall provides insight regarding why learning
under the partial return preference model tends to work so well in practice,
despite it conforming poorly to how humans give preferences.
Related papers
- Choice between Partial Trajectories [19.39067577784909]
It has been suggested that AI agents learn preferences from human choice data.
This approach requires a model of choice behavior that the agent can use to interpret the data.
We consider an alternative model based on the bootstrapped return, which adds to the partial return an estimate of the future return.
arXiv Detail & Related papers (2024-10-30T04:52:22Z) - Robust Reinforcement Learning from Corrupted Human Feedback [86.17030012828003]
Reinforcement learning from human feedback (RLHF) provides a principled framework for aligning AI systems with human preference data.
We propose a robust RLHF approach -- $R3M$, which models the potentially corrupted preference label as sparse outliers.
Our experiments on robotic control and natural language generation with large language models (LLMs) show that $R3M$ improves robustness of the reward against several types of perturbations to the preference data.
arXiv Detail & Related papers (2024-06-21T18:06:30Z) - KTO: Model Alignment as Prospect Theoretic Optimization [67.44320255397506]
Kahneman & Tversky's $textitprospect theory$ tells us that humans perceive random variables in a biased but well-defined manner.
We show that objectives for aligning LLMs with human feedback implicitly incorporate many of these biases.
We propose a HALO that directly maximizes the utility of generations instead of maximizing the log-likelihood of preferences.
arXiv Detail & Related papers (2024-02-02T10:53:36Z) - Contrastive Preference Learning: Learning from Human Feedback without RL [71.77024922527642]
We introduce Contrastive Preference Learning (CPL), an algorithm for learning optimal policies from preferences without learning reward functions.
CPL is fully off-policy, uses only a simple contrastive objective, and can be applied to arbitrary MDPs.
arXiv Detail & Related papers (2023-10-20T16:37:56Z) - Misspecification in Inverse Reinforcement Learning [80.91536434292328]
The aim of Inverse Reinforcement Learning (IRL) is to infer a reward function $R$ from a policy $pi$.
One of the primary motivations behind IRL is to infer human preferences from human behaviour.
This means that they are misspecified, which raises the worry that they might lead to unsound inferences if applied to real-world data.
arXiv Detail & Related papers (2022-12-06T18:21:47Z) - Scaling Laws for Reward Model Overoptimization [19.93331579503503]
We study how the gold reward model score changes as we optimize against the proxy reward model using either reinforcement learning or best-of-$n$ sampling.
We also study the effect on this relationship of the size of the reward model dataset, the number of reward model and policy parameters, and the coefficient of the KL penalty added to the reward in the reinforcement learning setup.
arXiv Detail & Related papers (2022-10-19T17:56:10Z) - Models of human preference for learning reward functions [80.39289349661364]
We learn the reward function from human-generated preferences between pairs of trajectory segments.
We find this assumption to be flawed and propose modeling human preferences as informed by each segment's regret.
Our proposed regret preference model better predicts real human preferences and also learns reward functions from these preferences that lead to policies that are better human-aligned.
arXiv Detail & Related papers (2022-06-05T17:58:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.