Misspecification in Inverse Reinforcement Learning
- URL: http://arxiv.org/abs/2212.03201v2
- Date: Fri, 24 Mar 2023 12:04:32 GMT
- Title: Misspecification in Inverse Reinforcement Learning
- Authors: Joar Skalse, Alessandro Abate
- Abstract summary: The aim of Inverse Reinforcement Learning (IRL) is to infer a reward function $R$ from a policy $pi$.
One of the primary motivations behind IRL is to infer human preferences from human behaviour.
This means that they are misspecified, which raises the worry that they might lead to unsound inferences if applied to real-world data.
- Score: 80.91536434292328
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The aim of Inverse Reinforcement Learning (IRL) is to infer a reward function
$R$ from a policy $\pi$. To do this, we need a model of how $\pi$ relates to
$R$. In the current literature, the most common models are optimality,
Boltzmann rationality, and causal entropy maximisation. One of the primary
motivations behind IRL is to infer human preferences from human behaviour.
However, the true relationship between human preferences and human behaviour is
much more complex than any of the models currently used in IRL. This means that
they are misspecified, which raises the worry that they might lead to unsound
inferences if applied to real-world data. In this paper, we provide a
mathematical analysis of how robust different IRL models are to
misspecification, and answer precisely how the demonstrator policy may differ
from each of the standard models before that model leads to faulty inferences
about the reward function $R$. We also introduce a framework for reasoning
about misspecification in IRL, together with formal tools that can be used to
easily derive the misspecification robustness of new IRL models.
Related papers
- Towards Reliable Alignment: Uncertainty-aware RLHF [14.20181662644689]
We show that the fluctuation of reward models can be detrimental to the alignment problem.
We show that such policies are more risk-averse in the sense that they are more cautious of uncertain rewards.
We use this ensemble of reward models to align language model using our methodology and observe that our empirical findings match our theoretical predictions.
arXiv Detail & Related papers (2024-10-31T08:26:51Z) - Robust Reinforcement Learning from Corrupted Human Feedback [86.17030012828003]
Reinforcement learning from human feedback (RLHF) provides a principled framework for aligning AI systems with human preference data.
We propose a robust RLHF approach -- $R3M$, which models the potentially corrupted preference label as sparse outliers.
Our experiments on robotic control and natural language generation with large language models (LLMs) show that $R3M$ improves robustness of the reward against several types of perturbations to the preference data.
arXiv Detail & Related papers (2024-06-21T18:06:30Z) - RewardBench: Evaluating Reward Models for Language Modeling [100.28366840977966]
We present RewardBench, a benchmark dataset and code-base for evaluation of reward models.
The dataset is a collection of prompt-chosen-rejected trios spanning chat, reasoning, and safety.
On the RewardBench leaderboard, we evaluate reward models trained with a variety of methods.
arXiv Detail & Related papers (2024-03-20T17:49:54Z) - Quantifying the Sensitivity of Inverse Reinforcement Learning to
Misspecification [72.08225446179783]
Inverse reinforcement learning aims to infer an agent's preferences from their behaviour.
To do this, we need a behavioural model of how $pi$ relates to $R$.
We analyse how sensitive the IRL problem is to misspecification of the behavioural model.
arXiv Detail & Related papers (2024-03-11T16:09:39Z) - KTO: Model Alignment as Prospect Theoretic Optimization [67.44320255397506]
Kahneman & Tversky's $textitprospect theory$ tells us that humans perceive random variables in a biased but well-defined manner.
We show that objectives for aligning LLMs with human feedback implicitly incorporate many of these biases.
We propose a HALO that directly maximizes the utility of generations instead of maximizing the log-likelihood of preferences.
arXiv Detail & Related papers (2024-02-02T10:53:36Z) - Differentially Private Reward Estimation with Preference Feedback [15.943664678210146]
Learning from preference-based feedback has recently gained considerable traction as a promising approach to align generative models with human interests.
An adversarial attack in any step of the above pipeline might reveal private and sensitive information of human labelers.
We focus on the problem of reward estimation from preference-based feedback while protecting privacy of each individual labelers.
arXiv Detail & Related papers (2023-10-30T16:58:30Z) - On the Sensitivity of Reward Inference to Misspecified Human Models [27.94055657571769]
Inferring reward functions from human behavior is at the center of value alignment - aligning AI objectives with what we, humans, actually want.
This begs the question: how accurate do these models need to be in order for the reward inference to be accurate?
We show that it is unfortunately possible to construct small adversarial biases in behavior that lead to arbitrarily large errors in the inferred reward.
arXiv Detail & Related papers (2022-12-09T08:16:20Z) - Scaling Laws for Reward Model Overoptimization [19.93331579503503]
We study how the gold reward model score changes as we optimize against the proxy reward model using either reinforcement learning or best-of-$n$ sampling.
We also study the effect on this relationship of the size of the reward model dataset, the number of reward model and policy parameters, and the coefficient of the KL penalty added to the reward in the reinforcement learning setup.
arXiv Detail & Related papers (2022-10-19T17:56:10Z) - Mismatched No More: Joint Model-Policy Optimization for Model-Based RL [172.37829823752364]
We propose a single objective for jointly training the model and the policy, such that updates to either component increases a lower bound on expected return.
Our objective is a global lower bound on expected return, and this bound becomes tight under certain assumptions.
The resulting algorithm (MnM) is conceptually similar to a GAN.
arXiv Detail & Related papers (2021-10-06T13:43:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.