Related papers: Dynamics-Aware Comparison of Learned Reward Functions

Dynamics-Aware Comparison of Learned Reward Functions

URL: http://arxiv.org/abs/2201.10081v1
Date: Tue, 25 Jan 2022 03:48:00 GMT
Title: Dynamics-Aware Comparison of Learned Reward Functions
Authors: Blake Wulfe and Ashwin Balakrishna and Logan Ellis and Jean Mercat and Rowan McAllister and Adrien Gaidon
Abstract summary: The ability to learn reward functions plays an important role in enabling the deployment of intelligent agents in the real world. Reward functions are typically compared by considering the behavior of optimized policies, but this approach conflates deficiencies in the reward function with those of the policy search algorithm used to optimize it. We propose the Dynamics-Aware Reward Distance (DARD), a new reward pseudometric.
Score: 21.159457412742356
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The ability to learn reward functions plays an important role in enabling the deployment of intelligent agents in the real world. However, comparing reward functions, for example as a means of evaluating reward learning methods, presents a challenge. Reward functions are typically compared by considering the behavior of optimized policies, but this approach conflates deficiencies in the reward function with those of the policy search algorithm used to optimize it. To address this challenge, Gleave et al. (2020) propose the Equivalent-Policy Invariant Comparison (EPIC) distance. EPIC avoids policy optimization, but in doing so requires computing reward values at transitions that may be impossible under the system dynamics. This is problematic for learned reward functions because it entails evaluating them outside of their training distribution, resulting in inaccurate reward values that we show can render EPIC ineffective at comparing rewards. To address this problem, we propose the Dynamics-Aware Reward Distance (DARD), a new reward pseudometric. DARD uses an approximate transition model of the environment to transform reward functions into a form that allows for comparisons that are invariant to reward shaping while only evaluating reward functions on transitions close to their training distribution. Experiments in simulated physical domains demonstrate that DARD enables reliable reward comparisons without policy optimization and is significantly more predictive than baseline methods of downstream policy performance when dealing with learned reward functions.

Related papers

TROFI: Trajectory-Ranked Offline Inverse Reinforcement Learning [48.31236495564408]
This paper proposes Trajectory-Ranked OFfline Inverse reinforcement learning (TROFI)<n>TROFI is a novel approach to effectively learn a policy offline without a pre-defined reward function.<n>We show that TROFI consistently outperforms baselines and performs comparably to using the ground truth reward to learn policies.
arXiv Detail & Related papers (2025-06-27T08:22:41Z)
REBEL: A Regularization-Based Solution for Reward Overoptimization in Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and user intentions, values, or social norms can be catastrophic in the real world. Current methods to mitigate this misalignment work by learning reward functions from human preferences. We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z)
STARC: A General Framework For Quantifying Differences Between Reward Functions [55.33869271912095]
We provide a class of pseudometrics on the space of all reward functions that we call STARC metrics. We show that STARC metrics induce both an upper and a lower bound on worst-case regret. We also identify a number of issues with reward metrics proposed by earlier works.
arXiv Detail & Related papers (2023-09-26T20:31:19Z)
Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy. Many algorithms for IRL have an inherently nested structure. We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z)
Basis for Intentions: Efficient Inverse Reinforcement Learning using Past Experience [89.30876995059168]
inverse reinforcement learning (IRL) -- inferring the reward function of an agent from observing its behavior. This paper addresses the problem of IRL -- inferring the reward function of an agent from observing its behavior.
arXiv Detail & Related papers (2022-08-09T17:29:49Z)
Invariance in Policy Optimisation and Partial Identifiability in Reward Learning [67.4640841144101]
We characterise the partial identifiability of the reward function given popular reward learning data sources. We also analyse the impact of this partial identifiability for several downstream tasks, such as policy optimisation.
arXiv Detail & Related papers (2022-03-14T20:19:15Z)
Replacing Rewards with Examples: Example-Based Policy Search via Recursive Classification [133.20816939521941]
In the standard Markov decision process formalism, users specify tasks by writing down a reward function. In many scenarios, the user is unable to describe the task in words or numbers, but can readily provide examples of what the world would look like if the task were solved. Motivated by this observation, we derive a control algorithm that aims to visit states that have a high probability of leading to successful outcomes, given only examples of successful outcome states.
arXiv Detail & Related papers (2021-03-23T16:19:55Z)
Self-Supervised Online Reward Shaping in Sparse-Reward Environments [36.01839934355542]
We propose a novel reinforcement learning framework that performs self-supervised online reward shaping. The proposed framework alternates between updating a policy and inferring a reward function. Experimental results on several sparse-reward environments demonstrate that the proposed algorithm is significantly more sample efficient than the state-of-the-art baseline.
arXiv Detail & Related papers (2021-03-08T03:28:04Z)
Efficient Exploration of Reward Functions in Inverse Reinforcement Learning via Bayesian Optimization [43.51553742077343]
inverse reinforcement learning (IRL) is relevant to a variety of tasks including value alignment and robot learning from demonstration. This paper presents an IRL framework called Bayesian optimization-IRL (BO-IRL) which identifies multiple solutions consistent with the expert demonstrations.
arXiv Detail & Related papers (2020-11-17T10:17:45Z)
Learning to Utilize Shaping Rewards: A New Approach of Reward Shaping [71.214923471669]
Reward shaping is an effective technique for incorporating domain knowledge into reinforcement learning (RL) In this paper, we consider the problem of adaptively utilizing a given shaping reward function. Experiments in sparse-reward cartpole and MuJoCo environments show that our algorithms can fully exploit beneficial shaping rewards.
arXiv Detail & Related papers (2020-11-05T05:34:14Z)
Reward Machines: Exploiting Reward Function Structure in Reinforcement Learning [22.242379207077217]
We show how to show the reward function's code to the RL agent so it can exploit the function's internal structure to learn optimal policies. First, we propose reward machines, a type of finite state machine that supports the specification of reward functions. We then describe different methodologies to exploit this structure to support learning, including automated reward shaping, task decomposition, and counterfactual reasoning with off-policy learning.
arXiv Detail & Related papers (2020-10-06T00:10:16Z)
Quantifying Differences in Reward Functions [24.66221171351157]
We introduce the Equivalent-Policy Invariant Comparison (EPIC) distance to quantify the difference between two reward functions directly. We prove EPIC is invariant on an equivalence class of reward functions that always induce the same optimal policy.
arXiv Detail & Related papers (2020-06-24T17:35:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.