Understanding Learned Reward Functions
- URL: http://arxiv.org/abs/2012.05862v1
- Date: Thu, 10 Dec 2020 18:19:48 GMT
- Title: Understanding Learned Reward Functions
- Authors: Eric J. Michaud, Adam Gleave, Stuart Russell
- Abstract summary: We investigate techniques for interpreting learned reward functions.
In particular, we apply saliency methods to identify failure modes and predict the robustness of reward functions.
We find that learned reward functions often implement surprising algorithms that rely on contingent aspects of the environment.
- Score: 6.714172005695389
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In many real-world tasks, it is not possible to procedurally specify an RL
agent's reward function. In such cases, a reward function must instead be
learned from interacting with and observing humans. However, current techniques
for reward learning may fail to produce reward functions which accurately
reflect user preferences. Absent significant advances in reward learning, it is
thus important to be able to audit learned reward functions to verify whether
they truly capture user preferences. In this paper, we investigate techniques
for interpreting learned reward functions. In particular, we apply saliency
methods to identify failure modes and predict the robustness of reward
functions. We find that learned reward functions often implement surprising
algorithms that rely on contingent aspects of the environment. We also discover
that existing interpretability techniques often attend to irrelevant changes in
reward output, suggesting that reward interpretability may need significantly
different methods from policy interpretability.
Related papers
- Adaptive Language-Guided Abstraction from Contrastive Explanations [53.48583372522492]
It is necessary to determine which features of the environment are relevant before determining how these features should be used to compute reward.
End-to-end methods for joint feature and reward learning often yield brittle reward functions that are sensitive to spurious state features.
This paper describes a method named ALGAE which alternates between using language models to iteratively identify human-meaningful features.
arXiv Detail & Related papers (2024-09-12T16:51:58Z) - STARC: A General Framework For Quantifying Differences Between Reward
Functions [55.33869271912095]
We provide a class of pseudometrics on the space of all reward functions that we call STARC metrics.
We show that STARC metrics induce both an upper and a lower bound on worst-case regret.
We also identify a number of issues with reward metrics proposed by earlier works.
arXiv Detail & Related papers (2023-09-26T20:31:19Z) - Basis for Intentions: Efficient Inverse Reinforcement Learning using
Past Experience [89.30876995059168]
inverse reinforcement learning (IRL) -- inferring the reward function of an agent from observing its behavior.
This paper addresses the problem of IRL -- inferring the reward function of an agent from observing its behavior.
arXiv Detail & Related papers (2022-08-09T17:29:49Z) - Preprocessing Reward Functions for Interpretability [2.538209532048867]
We propose exploiting the intrinsic structure of reward functions by first preprocessing them into simpler but equivalent reward functions.
Our empirical evaluation shows that preprocessed rewards are often significantly easier to understand than the original reward.
arXiv Detail & Related papers (2022-03-25T10:19:35Z) - Invariance in Policy Optimisation and Partial Identifiability in Reward
Learning [67.4640841144101]
We characterise the partial identifiability of the reward function given popular reward learning data sources.
We also analyse the impact of this partial identifiability for several downstream tasks, such as policy optimisation.
arXiv Detail & Related papers (2022-03-14T20:19:15Z) - Replacing Rewards with Examples: Example-Based Policy Search via
Recursive Classification [133.20816939521941]
In the standard Markov decision process formalism, users specify tasks by writing down a reward function.
In many scenarios, the user is unable to describe the task in words or numbers, but can readily provide examples of what the world would look like if the task were solved.
Motivated by this observation, we derive a control algorithm that aims to visit states that have a high probability of leading to successful outcomes, given only examples of successful outcome states.
arXiv Detail & Related papers (2021-03-23T16:19:55Z) - Deceptive Reinforcement Learning for Privacy-Preserving Planning [8.950168559003991]
Reinforcement learning is the problem of finding a behaviour policy based on rewards received from exploratory behaviour.
A key ingredient in reinforcement learning is a reward function, which determines how much reward (negative or positive) is given and when.
We present two models for solving the problem of privacy-preserving reinforcement learning.
arXiv Detail & Related papers (2021-02-05T06:50:04Z) - Learning to Utilize Shaping Rewards: A New Approach of Reward Shaping [71.214923471669]
Reward shaping is an effective technique for incorporating domain knowledge into reinforcement learning (RL)
In this paper, we consider the problem of adaptively utilizing a given shaping reward function.
Experiments in sparse-reward cartpole and MuJoCo environments show that our algorithms can fully exploit beneficial shaping rewards.
arXiv Detail & Related papers (2020-11-05T05:34:14Z) - Reward Machines: Exploiting Reward Function Structure in Reinforcement
Learning [22.242379207077217]
We show how to show the reward function's code to the RL agent so it can exploit the function's internal structure to learn optimal policies.
First, we propose reward machines, a type of finite state machine that supports the specification of reward functions.
We then describe different methodologies to exploit this structure to support learning, including automated reward shaping, task decomposition, and counterfactual reasoning with off-policy learning.
arXiv Detail & Related papers (2020-10-06T00:10:16Z) - Pitfalls of learning a reward function online [28.2272248328398]
We consider a continual (one life'') learning approach where the agent both learns the reward function and optimises for it at the same time.
This comes with a number of pitfalls, such as deliberately manipulating the learning process in one direction.
We show that an uninfluenceable process is automatically unriggable, and if the set of possible environments is sufficiently rich, the converse is true too.
arXiv Detail & Related papers (2020-04-28T16:58:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.