Related papers: Quantifying Differences in Reward Functions

Quantifying Differences in Reward Functions

URL: http://arxiv.org/abs/2006.13900v3
Date: Wed, 17 Mar 2021 21:54:55 GMT
Title: Quantifying Differences in Reward Functions
Authors: Adam Gleave, Michael Dennis, Shane Legg, Stuart Russell, Jan Leike
Abstract summary: We introduce the Equivalent-Policy Invariant Comparison (EPIC) distance to quantify the difference between two reward functions directly. We prove EPIC is invariant on an equivalence class of reward functions that always induce the same optimal policy.
Score: 24.66221171351157
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: For many tasks, the reward function is inaccessible to introspection or too complex to be specified procedurally, and must instead be learned from user data. Prior work has evaluated learned reward functions by evaluating policies optimized for the learned reward. However, this method cannot distinguish between the learned reward function failing to reflect user preferences and the policy optimization process failing to optimize the learned reward. Moreover, this method can only tell us about behavior in the evaluation environment, but the reward may incentivize very different behavior in even a slightly different deployment environment. To address these problems, we introduce the Equivalent-Policy Invariant Comparison (EPIC) distance to quantify the difference between two reward functions directly, without a policy optimization step. We prove EPIC is invariant on an equivalence class of reward functions that always induce the same optimal policy. Furthermore, we find EPIC can be efficiently approximated and is more robust than baselines to the choice of coverage distribution. Finally, we show that EPIC distance bounds the regret of optimal policies even under different transition dynamics, and we confirm empirically that it predicts policy training success. Our source code is available at https://github.com/HumanCompatibleAI/evaluating-rewards.

Related papers

Reward Adaptation Via Q-Manipulation [3.8065968624597324]
We propose a new solution to reward adaptation (RA), the problem where the learning agent adapts to a target reward function based on one or multiple existing behaviors. Our work represents a new approach to RA via the manipulation of Q-functions. We refer to such a method as Q-Manipulation (Q-M)
arXiv Detail & Related papers (2025-03-17T17:42:54Z)
A Generalized Acquisition Function for Preference-based Reward Learning [12.158619866176487]
Preference-based reward learning is a popular technique for teaching robots and autonomous systems how a human user wants them to perform a task. Previous works have shown that actively synthesizing preference queries to maximize information gain about the reward function parameters improves data efficiency. We show that it is possible to optimize for learning the reward function up to a behavioral equivalence class, such as inducing the same ranking over behaviors, distribution over choices, or other related definitions of what makes two rewards similar.
arXiv Detail & Related papers (2024-03-09T20:32:17Z)
STARC: A General Framework For Quantifying Differences Between Reward Functions [55.33869271912095]
We provide a class of pseudometrics on the space of all reward functions that we call STARC metrics. We show that STARC metrics induce both an upper and a lower bound on worst-case regret. We also identify a number of issues with reward metrics proposed by earlier works.
arXiv Detail & Related papers (2023-09-26T20:31:19Z)
Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy. Many algorithms for IRL have an inherently nested structure. We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z)
Invariance in Policy Optimisation and Partial Identifiability in Reward Learning [67.4640841144101]
We characterise the partial identifiability of the reward function given popular reward learning data sources. We also analyse the impact of this partial identifiability for several downstream tasks, such as policy optimisation.
arXiv Detail & Related papers (2022-03-14T20:19:15Z)
Dynamics-Aware Comparison of Learned Reward Functions [21.159457412742356]
The ability to learn reward functions plays an important role in enabling the deployment of intelligent agents in the real world. Reward functions are typically compared by considering the behavior of optimized policies, but this approach conflates deficiencies in the reward function with those of the policy search algorithm used to optimize it. We propose the Dynamics-Aware Reward Distance (DARD), a new reward pseudometric.
arXiv Detail & Related papers (2022-01-25T03:48:00Z)
Policy Gradient Bayesian Robust Optimization for Imitation Learning [49.881386773269746]
We derive a novel policy gradient-style robust optimization approach, PG-BROIL, to balance expected performance and risk. Results suggest PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse.
arXiv Detail & Related papers (2021-06-11T16:49:15Z)
Replacing Rewards with Examples: Example-Based Policy Search via Recursive Classification [133.20816939521941]
In the standard Markov decision process formalism, users specify tasks by writing down a reward function. In many scenarios, the user is unable to describe the task in words or numbers, but can readily provide examples of what the world would look like if the task were solved. Motivated by this observation, we derive a control algorithm that aims to visit states that have a high probability of leading to successful outcomes, given only examples of successful outcome states.
arXiv Detail & Related papers (2021-03-23T16:19:55Z)
Learning One Representation to Optimize All Rewards [19.636676744015197]
We introduce the forward-backward (FB) representation of the dynamics of a reward-free Markov decision process. It provides explicit near-optimal policies for any reward specified a posteriori. This is a step towards learning controllable agents in arbitrary black-box environments.
arXiv Detail & Related papers (2021-03-14T15:00:08Z)
Difference Rewards Policy Gradients [17.644110838053134]
We propose a novel algorithm that combines difference rewards with policy to allow for learning decentralized policies. By differencing the reward function directly, Dr.Reinforce avoids difficulties associated with learning the Q-function. We show the effectiveness of a version of Dr.Reinforce that learns an additional reward network that is used to estimate the difference rewards.
arXiv Detail & Related papers (2020-12-21T11:23:17Z)
Provably Efficient Reward-Agnostic Navigation with Linear Value Iteration [143.43658264904863]
We show how iteration under a more standard notion of low inherent Bellman error, typically employed in least-square value-style algorithms, can provide strong PAC guarantees on learning a near optimal value function. We present a computationally tractable algorithm for the reward-free setting and show how it can be used to learn a near optimal policy for any (linear) reward function.
arXiv Detail & Related papers (2020-08-18T04:34:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.