Related papers: On the Sensitivity of Reward Inference to Misspecified Human Models

On the Sensitivity of Reward Inference to Misspecified Human Models

URL: http://arxiv.org/abs/2212.04717v2
Date: Mon, 30 Oct 2023 05:01:12 GMT
Title: On the Sensitivity of Reward Inference to Misspecified Human Models
Authors: Joey Hong and Kush Bhatia and Anca Dragan
Abstract summary: Inferring reward functions from human behavior is at the center of value alignment - aligning AI objectives with what we, humans, actually want. This begs the question: how accurate do these models need to be in order for the reward inference to be accurate? We show that it is unfortunately possible to construct small adversarial biases in behavior that lead to arbitrarily large errors in the inferred reward.
Score: 27.94055657571769
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Inferring reward functions from human behavior is at the center of value alignment - aligning AI objectives with what we, humans, actually want. But doing so relies on models of how humans behave given their objectives. After decades of research in cognitive science, neuroscience, and behavioral economics, obtaining accurate human models remains an open research topic. This begs the question: how accurate do these models need to be in order for the reward inference to be accurate? On the one hand, if small errors in the model can lead to catastrophic error in inference, the entire framework of reward learning seems ill-fated, as we will never have perfect models of human behavior. On the other hand, if as our models improve, we can have a guarantee that reward accuracy also improves, this would show the benefit of more work on the modeling side. We study this question both theoretically and empirically. We do show that it is unfortunately possible to construct small adversarial biases in behavior that lead to arbitrarily large errors in the inferred reward. However, and arguably more importantly, we are also able to identify reasonable assumptions under which the reward inference error can be bounded linearly in the error in the human model. Finally, we verify our theoretical insights in discrete and continuous control tasks with simulated and human data.

Related papers

What Makes a Reward Model a Good Teacher? An Optimization Perspective [61.38643642719093]
We prove that regardless of accurate a reward model is, if it induces low reward variance, the RLHF objective suffers from a flat landscape. We additionally show that a reward model that works well for one language model can induce low reward variance, and thus a flat objective landscape, for another.
arXiv Detail & Related papers (2025-03-19T17:54:41Z)
Partial Identifiability and Misspecification in Inverse Reinforcement Learning [64.13583792391783]
The aim of Inverse Reinforcement Learning is to infer a reward function $R$ from a policy $pi$. This paper provides a comprehensive analysis of partial identifiability and misspecification in IRL.
arXiv Detail & Related papers (2024-11-24T18:35:46Z)
Can Language Models Learn to Skip Steps? [59.84848399905409]
We study the ability to skip steps in reasoning. Unlike humans, who may skip steps to enhance efficiency or to reduce cognitive load, models do not possess such motivations. Our work presents the first exploration into human-like step-skipping ability.
arXiv Detail & Related papers (2024-11-04T07:10:24Z)
The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret [64.04721528586747]
We show that a sufficiently low expected test error of the reward model guarantees low worst-case regret. We then show that similar problems persist even when using policy regularization techniques.
arXiv Detail & Related papers (2024-06-22T06:43:51Z)
Quantifying the Sensitivity of Inverse Reinforcement Learning to Misspecification [72.08225446179783]
Inverse reinforcement learning aims to infer an agent's preferences from their behaviour. To do this, we need a behavioural model of how $pi$ relates to $R$. We analyse how sensitive the IRL problem is to misspecification of the behavioural model.
arXiv Detail & Related papers (2024-03-11T16:09:39Z)
Misspecification in Inverse Reinforcement Learning [80.91536434292328]
The aim of Inverse Reinforcement Learning (IRL) is to infer a reward function $R$ from a policy $pi$. One of the primary motivations behind IRL is to infer human preferences from human behaviour. This means that they are misspecified, which raises the worry that they might lead to unsound inferences if applied to real-world data.
arXiv Detail & Related papers (2022-12-06T18:21:47Z)
Explain, Edit, and Understand: Rethinking User Study Design for Evaluating Model Explanations [97.91630330328815]
We conduct a crowdsourcing study, where participants interact with deception detection models that have been trained to distinguish between genuine and fake hotel reviews. We observe that for a linear bag-of-words model, participants with access to the feature coefficients during training are able to cause a larger reduction in model confidence in the testing phase when compared to the no-explanation control.
arXiv Detail & Related papers (2021-12-17T18:29:56Z)
Human irrationality: both bad and good for reward inference [3.706222947143855]
This work aims to better understand the effect irrationalities can have on reward inference. We operationalize irrationality in the language of MDPs, by altering the Bellman optimality equation. We show that an irrational human, when correctly modelled, can communicate more information about the reward than a perfectly rational human can.
arXiv Detail & Related papers (2021-11-12T21:44:15Z)
Modeling the Mistakes of Boundedly Rational Agents Within a Bayesian Theory of Mind [32.66203057545608]
We extend the Bayesian Theory of Mind framework to model boundedly rational agents who may have mistaken goals, plans, and actions. We present experiments eliciting human goal inferences in two domains: (i) a gridworld puzzle with gems locked behind doors, and (ii) a block-stacking domain.
arXiv Detail & Related papers (2021-06-24T18:00:03Z)
Are Visual Explanations Useful? A Case Study in Model-in-the-Loop Prediction [49.254162397086006]
We study explanations based on visual saliency in an image-based age prediction task. We find that presenting model predictions improves human accuracy. However, explanations of various kinds fail to significantly alter human accuracy or trust in the model.
arXiv Detail & Related papers (2020-07-23T20:39:40Z)
LESS is More: Rethinking Probabilistic Models of Human Behavior [36.020541093946925]
Boltzmann noisily-rational decision model assumes people approximately optimize a reward function. Human trajectories lie in a continuous space, with continuous-valued features that influence the reward function. We introduce a model that explicitly accounts for distances between trajectories, rather than only their rewards.
arXiv Detail & Related papers (2020-01-13T18:59:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.