Quantifying the Sensitivity of Inverse Reinforcement Learning to
Misspecification
- URL: http://arxiv.org/abs/2403.06854v1
- Date: Mon, 11 Mar 2024 16:09:39 GMT
- Title: Quantifying the Sensitivity of Inverse Reinforcement Learning to
Misspecification
- Authors: Joar Skalse and Alessandro Abate
- Abstract summary: Inverse reinforcement learning aims to infer an agent's preferences from their behaviour.
To do this, we need a behavioural model of how $pi$ relates to $R$.
We analyse how sensitive the IRL problem is to misspecification of the behavioural model.
- Score: 72.08225446179783
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Inverse reinforcement learning (IRL) aims to infer an agent's preferences
(represented as a reward function $R$) from their behaviour (represented as a
policy $\pi$). To do this, we need a behavioural model of how $\pi$ relates to
$R$. In the current literature, the most common behavioural models are
optimality, Boltzmann-rationality, and causal entropy maximisation. However,
the true relationship between a human's preferences and their behaviour is much
more complex than any of these behavioural models. This means that the
behavioural models are misspecified, which raises the concern that they may
lead to systematic errors if applied to real data. In this paper, we analyse
how sensitive the IRL problem is to misspecification of the behavioural model.
Specifically, we provide necessary and sufficient conditions that completely
characterise how the observed data may differ from the assumed behavioural
model without incurring an error above a given threshold. In addition to this,
we also characterise the conditions under which a behavioural model is robust
to small perturbations of the observed policy, and we analyse how robust many
behavioural models are to misspecification of their parameter values (such as
e.g.\ the discount rate). Our analysis suggests that the IRL problem is highly
sensitive to misspecification, in the sense that very mild misspecification can
lead to very large errors in the inferred reward function.
Related papers
- Partial Identifiability in Inverse Reinforcement Learning For Agents With Non-Exponential Discounting [64.13583792391783]
inverse reinforcement learning aims to infer an agent's preferences from observing their behaviour.
One of the central difficulties in IRL is that multiple preferences may lead to the same observed behaviour.
We show that generally IRL is unable to infer enough information about $R$ to identify the correct optimal policy.
arXiv Detail & Related papers (2024-12-15T11:08:58Z) - Partial Identifiability and Misspecification in Inverse Reinforcement Learning [64.13583792391783]
The aim of Inverse Reinforcement Learning is to infer a reward function $R$ from a policy $pi$.
This paper provides a comprehensive analysis of partial identifiability and misspecification in IRL.
arXiv Detail & Related papers (2024-11-24T18:35:46Z) - Inverse decision-making using neural amortized Bayesian actors [19.128377007314317]
We amortize the Bayesian actor using a neural network trained on a wide range of parameter settings in an unsupervised fashion.
We show how our method allows for principled model comparison and how it can be used to disentangle factors that may lead to unidentifiabilities between priors and costs.
arXiv Detail & Related papers (2024-09-04T10:31:35Z) - Representation Surgery: Theory and Practice of Affine Steering [72.61363182652853]
Language models often exhibit undesirable behavior, e.g., generating toxic or gender-biased text.
One natural (and common) approach to prevent the model from exhibiting undesirable behavior is to steer the model's representations.
This paper investigates the formal and empirical properties of steering functions.
arXiv Detail & Related papers (2024-02-15T00:20:30Z) - On the Sensitivity of Reward Inference to Misspecified Human Models [27.94055657571769]
Inferring reward functions from human behavior is at the center of value alignment - aligning AI objectives with what we, humans, actually want.
This begs the question: how accurate do these models need to be in order for the reward inference to be accurate?
We show that it is unfortunately possible to construct small adversarial biases in behavior that lead to arbitrarily large errors in the inferred reward.
arXiv Detail & Related papers (2022-12-09T08:16:20Z) - Misspecification in Inverse Reinforcement Learning [80.91536434292328]
The aim of Inverse Reinforcement Learning (IRL) is to infer a reward function $R$ from a policy $pi$.
One of the primary motivations behind IRL is to infer human preferences from human behaviour.
This means that they are misspecified, which raises the worry that they might lead to unsound inferences if applied to real-world data.
arXiv Detail & Related papers (2022-12-06T18:21:47Z) - Goal-directed Generation of Discrete Structures with Conditional
Generative Models [85.51463588099556]
We introduce a novel approach to directly optimize a reinforcement learning objective, maximizing an expected reward.
We test our methodology on two tasks: generating molecules with user-defined properties and identifying short python expressions which evaluate to a given target value.
arXiv Detail & Related papers (2020-10-05T20:03:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.