Causal Confusion and Reward Misidentification in Preference-Based Reward
Learning
- URL: http://arxiv.org/abs/2204.06601v4
- Date: Sat, 18 Mar 2023 20:44:45 GMT
- Title: Causal Confusion and Reward Misidentification in Preference-Based Reward
Learning
- Authors: Jeremy Tien, Jerry Zhi-Yang He, Zackory Erickson, Anca D. Dragan,
Daniel S. Brown
- Abstract summary: We study causal confusion and reward misidentification when learning from preferences.
We find that the presence of non-causal distractor features, noise in the stated preferences, and partial state observability can all exacerbate reward misidentification.
- Score: 33.944367978407904
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Learning policies via preference-based reward learning is an increasingly
popular method for customizing agent behavior, but has been shown anecdotally
to be prone to spurious correlations and reward hacking behaviors. While much
prior work focuses on causal confusion in reinforcement learning and behavioral
cloning, we focus on a systematic study of causal confusion and reward
misidentification when learning from preferences. In particular, we perform a
series of sensitivity and ablation analyses on several benchmark domains where
rewards learned from preferences achieve minimal test error but fail to
generalize to out-of-distribution states -- resulting in poor policy
performance when optimized. We find that the presence of non-causal distractor
features, noise in the stated preferences, and partial state observability can
all exacerbate reward misidentification. We also identify a set of methods with
which to interpret misidentified learned rewards. In general, we observe that
optimizing misidentified rewards drives the policy off the reward's training
distribution, resulting in high predicted (learned) rewards but low true
rewards. These findings illuminate the susceptibility of preference learning to
reward misidentification and causal confusion -- failure to consider even one
of many factors can result in unexpected, undesirable behavior.
Related papers
- Behavior Alignment via Reward Function Optimization [23.92721220310242]
We introduce a new framework that integrates auxiliary rewards reflecting a designer's domain knowledge with the environment's primary rewards.
We evaluate our method's efficacy on a diverse set of tasks, from small-scale experiments to high-dimensional control challenges.
arXiv Detail & Related papers (2023-10-29T13:45:07Z) - On The Fragility of Learned Reward Functions [4.826574398803286]
We study the causes of relearning failures in the domain of preference-based reward learning.
Based on our findings, we emphasize the need for more retraining-based evaluations in the literature.
arXiv Detail & Related papers (2023-01-09T19:45:38Z) - Causal Imitation Learning with Unobserved Confounders [82.22545916247269]
We study imitation learning when sensory inputs of the learner and the expert differ.
We show that imitation could still be feasible by exploiting quantitative knowledge of the expert trajectories.
arXiv Detail & Related papers (2022-08-12T13:29:53Z) - Reward Uncertainty for Exploration in Preference-based Reinforcement
Learning [88.34958680436552]
We present an exploration method specifically for preference-based reinforcement learning algorithms.
Our main idea is to design an intrinsic reward by measuring the novelty based on learned reward.
Our experiments show that exploration bonus from uncertainty in learned reward improves both feedback- and sample-efficiency of preference-based RL algorithms.
arXiv Detail & Related papers (2022-05-24T23:22:10Z) - Policy Gradient Bayesian Robust Optimization for Imitation Learning [49.881386773269746]
We derive a novel policy gradient-style robust optimization approach, PG-BROIL, to balance expected performance and risk.
Results suggest PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse.
arXiv Detail & Related papers (2021-06-11T16:49:15Z) - Information Directed Reward Learning for Reinforcement Learning [64.33774245655401]
We learn a model of the reward function that allows standard RL algorithms to achieve high expected return with as few expert queries as possible.
In contrast to prior active reward learning methods designed for specific types of queries, IDRL naturally accommodates different query types.
We support our findings with extensive evaluations in multiple environments and with different types of queries.
arXiv Detail & Related papers (2021-02-24T18:46:42Z) - Deceptive Reinforcement Learning for Privacy-Preserving Planning [8.950168559003991]
Reinforcement learning is the problem of finding a behaviour policy based on rewards received from exploratory behaviour.
A key ingredient in reinforcement learning is a reward function, which determines how much reward (negative or positive) is given and when.
We present two models for solving the problem of privacy-preserving reinforcement learning.
arXiv Detail & Related papers (2021-02-05T06:50:04Z) - Understanding Learned Reward Functions [6.714172005695389]
We investigate techniques for interpreting learned reward functions.
In particular, we apply saliency methods to identify failure modes and predict the robustness of reward functions.
We find that learned reward functions often implement surprising algorithms that rely on contingent aspects of the environment.
arXiv Detail & Related papers (2020-12-10T18:19:48Z) - Learning "What-if" Explanations for Sequential Decision-Making [92.8311073739295]
Building interpretable parameterizations of real-world decision-making on the basis of demonstrated behavior is essential.
We propose learning explanations of expert decisions by modeling their reward function in terms of preferences with respect to "what if" outcomes.
We highlight the effectiveness of our batch, counterfactual inverse reinforcement learning approach in recovering accurate and interpretable descriptions of behavior.
arXiv Detail & Related papers (2020-07-02T14:24:17Z) - Effects of sparse rewards of different magnitudes in the speed of
learning of model-based actor critic methods [0.4640835690336653]
We show that we can influence an agent to learn faster by applying an external environmental pressure during training.
Results have been shown to be valid for Deep Deterministic Policy Gradients using Hindsight Experience Replay in a well known Mujoco environment.
arXiv Detail & Related papers (2020-01-18T20:52:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.