Choice Set Misspecification in Reward Inference
- URL: http://arxiv.org/abs/2101.07691v1
- Date: Tue, 19 Jan 2021 15:35:30 GMT
- Title: Choice Set Misspecification in Reward Inference
- Authors: Rachel Freedman, Rohin Shah and Anca Dragan
- Abstract summary: A promising alternative to manually specifying reward functions is to enable robots to infer them from human feedback.
In this work, we introduce the idea that the choice set itself might be difficult to specify, and analyze choice set misspecification.
We propose a classification of different kinds of choice set misspecification, and show that these different classes lead to meaningful differences in the inferred reward.
- Score: 14.861109950708999
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Specifying reward functions for robots that operate in environments without a
natural reward signal can be challenging, and incorrectly specified rewards can
incentivise degenerate or dangerous behavior. A promising alternative to
manually specifying reward functions is to enable robots to infer them from
human feedback, like demonstrations or corrections. To interpret this feedback,
robots treat as approximately optimal a choice the person makes from a choice
set, like the set of possible trajectories they could have demonstrated or
possible corrections they could have made. In this work, we introduce the idea
that the choice set itself might be difficult to specify, and analyze choice
set misspecification: what happens as the robot makes incorrect assumptions
about the set of choices from which the human selects their feedback. We
propose a classification of different kinds of choice set misspecification, and
show that these different classes lead to meaningful differences in the
inferred reward and resulting performance. While we would normally expect
misspecification to hurt, we find that certain kinds of misspecification are
neither helpful nor harmful (in expectation). However, in other situations,
misspecification can be extremely harmful, leading the robot to believe the
opposite of what it should believe. We hope our results will allow for better
prediction and response to the effects of misspecification in real-world reward
inference.
Related papers
- Conformal Prediction Sets Can Cause Disparate Impact [4.61590049339329]
Conformal prediction is a promising method for quantifying the uncertainty of machine learning models.
We show that providing prediction sets can increase the unfairness of their decisions.
Instead of equalizing coverage, we propose to equalize set sizes across groups which empirically leads to more fair outcomes.
arXiv Detail & Related papers (2024-10-02T18:00:01Z) - REBEL: A Regularization-Based Solution for Reward Overoptimization in Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and user intentions, values, or social norms can be catastrophic in the real world.
Current methods to mitigate this misalignment work by learning reward functions from human preferences.
We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z) - Helping or Herding? Reward Model Ensembles Mitigate but do not Eliminate Reward Hacking [62.146953368613815]
Reward models play a key role in aligning language model applications towards human preferences.
A natural mitigation is to train an ensemble of reward models, aggregating over model outputs to obtain a more robust reward estimate.
We show that reward ensembles do not eliminate reward hacking because all reward models in the ensemble exhibit similar error patterns.
arXiv Detail & Related papers (2023-12-14T18:59:04Z) - What Matters to You? Towards Visual Representation Alignment for Robot
Learning [81.30964736676103]
When operating in service of people, robots need to optimize rewards aligned with end-user preferences.
We propose Representation-Aligned Preference-based Learning (RAPL), a method for solving the visual representation alignment problem.
arXiv Detail & Related papers (2023-10-11T23:04:07Z) - Models of human preference for learning reward functions [80.39289349661364]
We learn the reward function from human-generated preferences between pairs of trajectory segments.
We find this assumption to be flawed and propose modeling human preferences as informed by each segment's regret.
Our proposed regret preference model better predicts real human preferences and also learns reward functions from these preferences that lead to policies that are better human-aligned.
arXiv Detail & Related papers (2022-06-05T17:58:02Z) - Causal Confusion and Reward Misidentification in Preference-Based Reward
Learning [33.944367978407904]
We study causal confusion and reward misidentification when learning from preferences.
We find that the presence of non-causal distractor features, noise in the stated preferences, and partial state observability can all exacerbate reward misidentification.
arXiv Detail & Related papers (2022-04-13T18:41:41Z) - Correcting Robot Plans with Natural Language Feedback [88.92824527743105]
We explore natural language as an expressive and flexible tool for robot correction.
We show that these transformations enable users to correct goals, update robot motions, and recover from planning errors.
Our method makes it possible to compose multiple constraints and generalizes to unseen scenes, objects, and sentences in simulated environments and real-world environments.
arXiv Detail & Related papers (2022-04-11T15:22:43Z) - Policy Gradient Bayesian Robust Optimization for Imitation Learning [49.881386773269746]
We derive a novel policy gradient-style robust optimization approach, PG-BROIL, to balance expected performance and risk.
Results suggest PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse.
arXiv Detail & Related papers (2021-06-11T16:49:15Z) - Understanding Prediction Discrepancies in Machine Learning Classifiers [4.940323406667406]
This paper proposes to analyze the prediction discrepancies in a pool of best-performing models trained on the same data.
A model-agnostic algorithm, DIG, is proposed to capture and explain discrepancies locally.
arXiv Detail & Related papers (2021-04-12T13:42:50Z) - Reward-rational (implicit) choice: A unifying formalism for reward
learning [35.57436895497646]
Researchers have aimed to learn reward functions from human behavior or feedback.
The types of behavior interpreted as evidence of the reward function have expanded greatly in recent years.
How will a robot make sense of all these diverse types of behavior?
arXiv Detail & Related papers (2020-02-12T08:07:49Z) - LESS is More: Rethinking Probabilistic Models of Human Behavior [36.020541093946925]
Boltzmann noisily-rational decision model assumes people approximately optimize a reward function.
Human trajectories lie in a continuous space, with continuous-valued features that influence the reward function.
We introduce a model that explicitly accounts for distances between trajectories, rather than only their rewards.
arXiv Detail & Related papers (2020-01-13T18:59:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.