Related papers: Quantifying the Sensitivity of Inverse Reinforcement Learning to Misspecification

Quantifying the Sensitivity of Inverse Reinforcement Learning to Misspecification

URL: http://arxiv.org/abs/2403.06854v1
Date: Mon, 11 Mar 2024 16:09:39 GMT
Title: Quantifying the Sensitivity of Inverse Reinforcement Learning to Misspecification
Authors: Joar Skalse and Alessandro Abate
Abstract summary: Inverse reinforcement learning aims to infer an agent's preferences from their behaviour. To do this, we need a behavioural model of how $pi$ relates to $R$. We analyse how sensitive the IRL problem is to misspecification of the behavioural model.
Score: 72.08225446179783
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Inverse reinforcement learning (IRL) aims to infer an agent's preferences (represented as a reward function $R$) from their behaviour (represented as a policy $\pi$). To do this, we need a behavioural model of how $\pi$ relates to $R$. In the current literature, the most common behavioural models are optimality, Boltzmann-rationality, and causal entropy maximisation. However, the true relationship between a human's preferences and their behaviour is much more complex than any of these behavioural models. This means that the behavioural models are misspecified, which raises the concern that they may lead to systematic errors if applied to real data. In this paper, we analyse how sensitive the IRL problem is to misspecification of the behavioural model. Specifically, we provide necessary and sufficient conditions that completely characterise how the observed data may differ from the assumed behavioural model without incurring an error above a given threshold. In addition to this, we also characterise the conditions under which a behavioural model is robust to small perturbations of the observed policy, and we analyse how robust many behavioural models are to misspecification of their parameter values (such as e.g.\ the discount rate). Our analysis suggests that the IRL problem is highly sensitive to misspecification, in the sense that very mild misspecification can lead to very large errors in the inferred reward function.

Related papers

Preference Learning for AI Alignment: a Causal Perspective [55.2480439325792]
We frame this problem in a causal paradigm, providing the rich toolbox of causality to identify persistent challenges.<n>Inheriting from the literature of causal inference, we identify key assumptions necessary for reliable generalisation.<n>We illustrate failure modes of naive reward models and demonstrate how causally-inspired approaches can improve model robustness.
arXiv Detail & Related papers (2025-06-06T10:45:42Z)
Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors [61.92704516732144]
We show that the most robust features for correctness prediction are those that play a distinctive causal role in the model's behavior.<n>We propose two methods that leverage causal mechanisms to predict the correctness of model outputs.
arXiv Detail & Related papers (2025-05-17T00:31:39Z)
Partial Identifiability in Inverse Reinforcement Learning For Agents With Non-Exponential Discounting [64.13583792391783]
inverse reinforcement learning aims to infer an agent's preferences from observing their behaviour. One of the central difficulties in IRL is that multiple preferences may lead to the same observed behaviour. We show that generally IRL is unable to infer enough information about $R$ to identify the correct optimal policy.
arXiv Detail & Related papers (2024-12-15T11:08:58Z)
Partial Identifiability and Misspecification in Inverse Reinforcement Learning [64.13583792391783]
The aim of Inverse Reinforcement Learning is to infer a reward function $R$ from a policy $pi$. This paper provides a comprehensive analysis of partial identifiability and misspecification in IRL.
arXiv Detail & Related papers (2024-11-24T18:35:46Z)
Uncertainty-aware Human Mobility Modeling and Anomaly Detection [28.311683535974634]
We study how to model human agents' mobility behavior toward effective anomaly detection. We use GPS data as a sequence stay-point events, each with a set of characterizingtemporal features. Experiments on large expert-simulated datasets with tens of thousands of agents demonstrate the effectiveness of our model.
arXiv Detail & Related papers (2024-10-02T06:57:08Z)
Inverse decision-making using neural amortized Bayesian actors [19.128377007314317]
We amortize the Bayesian actor using a neural network trained on a wide range of different parameter settings in an unsupervised fashion. We show that the inferred posterior distributions are in close alignment with those obtained using analytical solutions where they exist. We then show that identifiability problems between priors and costs can arise in more complex cost functions.
arXiv Detail & Related papers (2024-09-04T10:31:35Z)
Representation Surgery: Theory and Practice of Affine Steering [72.61363182652853]
Language models often exhibit undesirable behavior, e.g., generating toxic or gender-biased text. One natural (and common) approach to prevent the model from exhibiting undesirable behavior is to steer the model's representations. This paper investigates the formal and empirical properties of steering functions.
arXiv Detail & Related papers (2024-02-15T00:20:30Z)
Robust Counterfactual Explanations for Neural Networks With Probabilistic Guarantees [11.841312820944774]
We propose a measure -- that we call $textitStability$ -- to quantify the robustness of counterfactuals to potential model changes for differentiable models. Our main contribution is to show that counterfactuals with sufficiently high value of $textitStability$ will remain valid after potential model changes with high probability.
arXiv Detail & Related papers (2023-05-19T20:48:05Z)
On the Sensitivity of Reward Inference to Misspecified Human Models [27.94055657571769]
Inferring reward functions from human behavior is at the center of value alignment - aligning AI objectives with what we, humans, actually want. This begs the question: how accurate do these models need to be in order for the reward inference to be accurate? We show that it is unfortunately possible to construct small adversarial biases in behavior that lead to arbitrarily large errors in the inferred reward.
arXiv Detail & Related papers (2022-12-09T08:16:20Z)
Misspecification in Inverse Reinforcement Learning [80.91536434292328]
The aim of Inverse Reinforcement Learning (IRL) is to infer a reward function $R$ from a policy $pi$. One of the primary motivations behind IRL is to infer human preferences from human behaviour. This means that they are misspecified, which raises the worry that they might lead to unsound inferences if applied to real-world data.
arXiv Detail & Related papers (2022-12-06T18:21:47Z)
Estimation of Bivariate Structural Causal Models by Variational Gaussian Process Regression Under Likelihoods Parametrised by Normalising Flows [74.85071867225533]
Causal mechanisms can be described by structural causal models. One major drawback of state-of-the-art artificial intelligence is its lack of explainability.
arXiv Detail & Related papers (2021-09-06T14:52:58Z)
Goal-directed Generation of Discrete Structures with Conditional Generative Models [85.51463588099556]
We introduce a novel approach to directly optimize a reinforcement learning objective, maximizing an expected reward. We test our methodology on two tasks: generating molecules with user-defined properties and identifying short python expressions which evaluate to a given target value.
arXiv Detail & Related papers (2020-10-05T20:03:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.