Related papers: Preprocessing Reward Functions for Interpretability

Preprocessing Reward Functions for Interpretability

URL: http://arxiv.org/abs/2203.13553v1
Date: Fri, 25 Mar 2022 10:19:35 GMT
Title: Preprocessing Reward Functions for Interpretability
Authors: Erik Jenner, Adam Gleave
Abstract summary: We propose exploiting the intrinsic structure of reward functions by first preprocessing them into simpler but equivalent reward functions. Our empirical evaluation shows that preprocessed rewards are often significantly easier to understand than the original reward.
Score: 2.538209532048867
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In many real-world applications, the reward function is too complex to be manually specified. In such cases, reward functions must instead be learned from human feedback. Since the learned reward may fail to represent user preferences, it is important to be able to validate the learned reward function prior to deployment. One promising approach is to apply interpretability tools to the reward function to spot potential deviations from the user's intention. Existing work has applied general-purpose interpretability tools to understand learned reward functions. We propose exploiting the intrinsic structure of reward functions by first preprocessing them into simpler but equivalent reward functions, which are then visualized. We introduce a general framework for such reward preprocessing and propose concrete preprocessing algorithms. Our empirical evaluation shows that preprocessed rewards are often significantly easier to understand than the original reward.

Related papers

STARC: A General Framework For Quantifying Differences Between Reward Functions [55.33869271912095]
We provide a class of pseudometrics on the space of all reward functions that we call STARC metrics. We show that STARC metrics induce both an upper and a lower bound on worst-case regret. We also identify a number of issues with reward metrics proposed by earlier works.
arXiv Detail & Related papers (2023-09-26T20:31:19Z)
Reward Collapse in Aligning Large Language Models [64.98482888193267]
We study the phenomenon of textitreward collapse', an empirical observation where the prevailing ranking-based approach results in an textitidentical reward distribution. Our experimental results suggest that our proposed prompt-aware utility functions significantly alleviate reward collapse during the training of reward models.
arXiv Detail & Related papers (2023-05-28T02:12:00Z)
Unpacking Reward Shaping: Understanding the Benefits of Reward Engineering on Sample Complexity [114.88145406445483]
Reinforcement learning provides an automated framework for learning behaviors from high-level reward specifications. In practice the choice of reward function can be crucial for good results.
arXiv Detail & Related papers (2022-10-18T04:21:25Z)
Basis for Intentions: Efficient Inverse Reinforcement Learning using Past Experience [89.30876995059168]
inverse reinforcement learning (IRL) -- inferring the reward function of an agent from observing its behavior. This paper addresses the problem of IRL -- inferring the reward function of an agent from observing its behavior.
arXiv Detail & Related papers (2022-08-09T17:29:49Z)
Invariance in Policy Optimisation and Partial Identifiability in Reward Learning [67.4640841144101]
We characterise the partial identifiability of the reward function given popular reward learning data sources. We also analyse the impact of this partial identifiability for several downstream tasks, such as policy optimisation.
arXiv Detail & Related papers (2022-03-14T20:19:15Z)
Dynamics-Aware Comparison of Learned Reward Functions [21.159457412742356]
The ability to learn reward functions plays an important role in enabling the deployment of intelligent agents in the real world. Reward functions are typically compared by considering the behavior of optimized policies, but this approach conflates deficiencies in the reward function with those of the policy search algorithm used to optimize it. We propose the Dynamics-Aware Reward Distance (DARD), a new reward pseudometric.
arXiv Detail & Related papers (2022-01-25T03:48:00Z)
Replacing Rewards with Examples: Example-Based Policy Search via Recursive Classification [133.20816939521941]
In the standard Markov decision process formalism, users specify tasks by writing down a reward function. In many scenarios, the user is unable to describe the task in words or numbers, but can readily provide examples of what the world would look like if the task were solved. Motivated by this observation, we derive a control algorithm that aims to visit states that have a high probability of leading to successful outcomes, given only examples of successful outcome states.
arXiv Detail & Related papers (2021-03-23T16:19:55Z)
Understanding Learned Reward Functions [6.714172005695389]
We investigate techniques for interpreting learned reward functions. In particular, we apply saliency methods to identify failure modes and predict the robustness of reward functions. We find that learned reward functions often implement surprising algorithms that rely on contingent aspects of the environment.
arXiv Detail & Related papers (2020-12-10T18:19:48Z)
Learning to Utilize Shaping Rewards: A New Approach of Reward Shaping [71.214923471669]
Reward shaping is an effective technique for incorporating domain knowledge into reinforcement learning (RL) In this paper, we consider the problem of adaptively utilizing a given shaping reward function. Experiments in sparse-reward cartpole and MuJoCo environments show that our algorithms can fully exploit beneficial shaping rewards.
arXiv Detail & Related papers (2020-11-05T05:34:14Z)
Reward Machines: Exploiting Reward Function Structure in Reinforcement Learning [22.242379207077217]
We show how to show the reward function's code to the RL agent so it can exploit the function's internal structure to learn optimal policies. First, we propose reward machines, a type of finite state machine that supports the specification of reward functions. We then describe different methodologies to exploit this structure to support learning, including automated reward shaping, task decomposition, and counterfactual reasoning with off-policy learning.
arXiv Detail & Related papers (2020-10-06T00:10:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.