Quantifying the Sensitivity of Inverse Reinforcement Learning to
Misspecification
- URL: http://arxiv.org/abs/2403.06854v1
- Date: Mon, 11 Mar 2024 16:09:39 GMT
- Title: Quantifying the Sensitivity of Inverse Reinforcement Learning to
Misspecification
- Authors: Joar Skalse and Alessandro Abate
- Abstract summary: Inverse reinforcement learning aims to infer an agent's preferences from their behaviour.
To do this, we need a behavioural model of how $pi$ relates to $R$.
We analyse how sensitive the IRL problem is to misspecification of the behavioural model.
- Score: 72.08225446179783
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Inverse reinforcement learning (IRL) aims to infer an agent's preferences
(represented as a reward function $R$) from their behaviour (represented as a
policy $\pi$). To do this, we need a behavioural model of how $\pi$ relates to
$R$. In the current literature, the most common behavioural models are
optimality, Boltzmann-rationality, and causal entropy maximisation. However,
the true relationship between a human's preferences and their behaviour is much
more complex than any of these behavioural models. This means that the
behavioural models are misspecified, which raises the concern that they may
lead to systematic errors if applied to real data. In this paper, we analyse
how sensitive the IRL problem is to misspecification of the behavioural model.
Specifically, we provide necessary and sufficient conditions that completely
characterise how the observed data may differ from the assumed behavioural
model without incurring an error above a given threshold. In addition to this,
we also characterise the conditions under which a behavioural model is robust
to small perturbations of the observed policy, and we analyse how robust many
behavioural models are to misspecification of their parameter values (such as
e.g.\ the discount rate). Our analysis suggests that the IRL problem is highly
sensitive to misspecification, in the sense that very mild misspecification can
lead to very large errors in the inferred reward function.
Related papers
- Uncertainty-aware Human Mobility Modeling and Anomaly Detection [28.311683535974634]
We study how to model human agents' mobility behavior toward effective anomaly detection.
We use GPS data as a sequence stay-point events, each with a set of characterizingtemporal features.
Experiments on large expert-simulated datasets with tens of thousands of agents demonstrate the effectiveness of our model.
arXiv Detail & Related papers (2024-10-02T06:57:08Z) - Inverse decision-making using neural amortized Bayesian actors [19.128377007314317]
We amortize the Bayesian actor using a neural network trained on a wide range of different parameter settings in an unsupervised fashion.
We show that the inferred posterior distributions are in close alignment with those obtained using analytical solutions where they exist.
We then show that identifiability problems between priors and costs can arise in more complex cost functions.
arXiv Detail & Related papers (2024-09-04T10:31:35Z) - Representation Surgery: Theory and Practice of Affine Steering [72.61363182652853]
Language models often exhibit undesirable behavior, e.g., generating toxic or gender-biased text.
One natural (and common) approach to prevent the model from exhibiting undesirable behavior is to steer the model's representations.
This paper investigates the formal and empirical properties of steering functions.
arXiv Detail & Related papers (2024-02-15T00:20:30Z) - Robust Counterfactual Explanations for Neural Networks With Probabilistic Guarantees [11.841312820944774]
We propose a measure -- that we call $textitStability$ -- to quantify the robustness of counterfactuals to potential model changes for differentiable models.
Our main contribution is to show that counterfactuals with sufficiently high value of $textitStability$ will remain valid after potential model changes with high probability.
arXiv Detail & Related papers (2023-05-19T20:48:05Z) - On the Sensitivity of Reward Inference to Misspecified Human Models [27.94055657571769]
Inferring reward functions from human behavior is at the center of value alignment - aligning AI objectives with what we, humans, actually want.
This begs the question: how accurate do these models need to be in order for the reward inference to be accurate?
We show that it is unfortunately possible to construct small adversarial biases in behavior that lead to arbitrarily large errors in the inferred reward.
arXiv Detail & Related papers (2022-12-09T08:16:20Z) - Misspecification in Inverse Reinforcement Learning [80.91536434292328]
The aim of Inverse Reinforcement Learning (IRL) is to infer a reward function $R$ from a policy $pi$.
One of the primary motivations behind IRL is to infer human preferences from human behaviour.
This means that they are misspecified, which raises the worry that they might lead to unsound inferences if applied to real-world data.
arXiv Detail & Related papers (2022-12-06T18:21:47Z) - Estimation of Bivariate Structural Causal Models by Variational Gaussian
Process Regression Under Likelihoods Parametrised by Normalising Flows [74.85071867225533]
Causal mechanisms can be described by structural causal models.
One major drawback of state-of-the-art artificial intelligence is its lack of explainability.
arXiv Detail & Related papers (2021-09-06T14:52:58Z) - To what extent do human explanations of model behavior align with actual
model behavior? [91.67905128825402]
We investigated the extent to which human-generated explanations of models' inference decisions align with how models actually make these decisions.
We defined two alignment metrics that quantify how well natural language human explanations align with model sensitivity to input words.
We find that a model's alignment with human explanations is not predicted by the model's accuracy on NLI.
arXiv Detail & Related papers (2020-12-24T17:40:06Z) - Goal-directed Generation of Discrete Structures with Conditional
Generative Models [85.51463588099556]
We introduce a novel approach to directly optimize a reinforcement learning objective, maximizing an expected reward.
We test our methodology on two tasks: generating molecules with user-defined properties and identifying short python expressions which evaluate to a given target value.
arXiv Detail & Related papers (2020-10-05T20:03:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.