On the Sensitivity of Reward Inference to Misspecified Human Models
- URL: http://arxiv.org/abs/2212.04717v2
- Date: Mon, 30 Oct 2023 05:01:12 GMT
- Title: On the Sensitivity of Reward Inference to Misspecified Human Models
- Authors: Joey Hong and Kush Bhatia and Anca Dragan
- Abstract summary: Inferring reward functions from human behavior is at the center of value alignment - aligning AI objectives with what we, humans, actually want.
This begs the question: how accurate do these models need to be in order for the reward inference to be accurate?
We show that it is unfortunately possible to construct small adversarial biases in behavior that lead to arbitrarily large errors in the inferred reward.
- Score: 27.94055657571769
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Inferring reward functions from human behavior is at the center of value
alignment - aligning AI objectives with what we, humans, actually want. But
doing so relies on models of how humans behave given their objectives. After
decades of research in cognitive science, neuroscience, and behavioral
economics, obtaining accurate human models remains an open research topic. This
begs the question: how accurate do these models need to be in order for the
reward inference to be accurate? On the one hand, if small errors in the model
can lead to catastrophic error in inference, the entire framework of reward
learning seems ill-fated, as we will never have perfect models of human
behavior. On the other hand, if as our models improve, we can have a guarantee
that reward accuracy also improves, this would show the benefit of more work on
the modeling side. We study this question both theoretically and empirically.
We do show that it is unfortunately possible to construct small adversarial
biases in behavior that lead to arbitrarily large errors in the inferred
reward. However, and arguably more importantly, we are also able to identify
reasonable assumptions under which the reward inference error can be bounded
linearly in the error in the human model. Finally, we verify our theoretical
insights in discrete and continuous control tasks with simulated and human
data.
Related papers
- Can Language Models Learn to Skip Steps? [59.84848399905409]
We study the ability to skip steps in reasoning.
Unlike humans, who may skip steps to enhance efficiency or to reduce cognitive load, models do not possess such motivations.
Our work presents the first exploration into human-like step-skipping ability.
arXiv Detail & Related papers (2024-11-04T07:10:24Z) - Quantifying the Sensitivity of Inverse Reinforcement Learning to
Misspecification [72.08225446179783]
Inverse reinforcement learning aims to infer an agent's preferences from their behaviour.
To do this, we need a behavioural model of how $pi$ relates to $R$.
We analyse how sensitive the IRL problem is to misspecification of the behavioural model.
arXiv Detail & Related papers (2024-03-11T16:09:39Z) - Misspecification in Inverse Reinforcement Learning [80.91536434292328]
The aim of Inverse Reinforcement Learning (IRL) is to infer a reward function $R$ from a policy $pi$.
One of the primary motivations behind IRL is to infer human preferences from human behaviour.
This means that they are misspecified, which raises the worry that they might lead to unsound inferences if applied to real-world data.
arXiv Detail & Related papers (2022-12-06T18:21:47Z) - Explain, Edit, and Understand: Rethinking User Study Design for
Evaluating Model Explanations [97.91630330328815]
We conduct a crowdsourcing study, where participants interact with deception detection models that have been trained to distinguish between genuine and fake hotel reviews.
We observe that for a linear bag-of-words model, participants with access to the feature coefficients during training are able to cause a larger reduction in model confidence in the testing phase when compared to the no-explanation control.
arXiv Detail & Related papers (2021-12-17T18:29:56Z) - Human irrationality: both bad and good for reward inference [3.706222947143855]
This work aims to better understand the effect irrationalities can have on reward inference.
We operationalize irrationality in the language of MDPs, by altering the Bellman optimality equation.
We show that an irrational human, when correctly modelled, can communicate more information about the reward than a perfectly rational human can.
arXiv Detail & Related papers (2021-11-12T21:44:15Z) - Modeling the Mistakes of Boundedly Rational Agents Within a Bayesian
Theory of Mind [32.66203057545608]
We extend the Bayesian Theory of Mind framework to model boundedly rational agents who may have mistaken goals, plans, and actions.
We present experiments eliciting human goal inferences in two domains: (i) a gridworld puzzle with gems locked behind doors, and (ii) a block-stacking domain.
arXiv Detail & Related papers (2021-06-24T18:00:03Z) - Measuring Massive Multitask Language Understanding [79.6985576698597]
The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
Largest GPT-3 model improves over random chance by almost 20 percentage points on average.
Models also have lopsided performance and frequently do not know when they are wrong.
arXiv Detail & Related papers (2020-09-07T17:59:25Z) - Are Visual Explanations Useful? A Case Study in Model-in-the-Loop
Prediction [49.254162397086006]
We study explanations based on visual saliency in an image-based age prediction task.
We find that presenting model predictions improves human accuracy.
However, explanations of various kinds fail to significantly alter human accuracy or trust in the model.
arXiv Detail & Related papers (2020-07-23T20:39:40Z) - LESS is More: Rethinking Probabilistic Models of Human Behavior [36.020541093946925]
Boltzmann noisily-rational decision model assumes people approximately optimize a reward function.
Human trajectories lie in a continuous space, with continuous-valued features that influence the reward function.
We introduce a model that explicitly accounts for distances between trajectories, rather than only their rewards.
arXiv Detail & Related papers (2020-01-13T18:59:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.