Related papers: Replacing Rewards with Examples: Example-Based Policy Search via Recursive Classification

Replacing Rewards with Examples: Example-Based Policy Search via Recursive Classification

URL: http://arxiv.org/abs/2103.12656v1
Date: Tue, 23 Mar 2021 16:19:55 GMT
Title: Replacing Rewards with Examples: Example-Based Policy Search via Recursive Classification
Authors: Benjamin Eysenbach, Sergey Levine, and Ruslan Salakhutdinov
Abstract summary: In the standard Markov decision process formalism, users specify tasks by writing down a reward function. In many scenarios, the user is unable to describe the task in words or numbers, but can readily provide examples of what the world would look like if the task were solved. Motivated by this observation, we derive a control algorithm that aims to visit states that have a high probability of leading to successful outcomes, given only examples of successful outcome states.
Score: 133.20816939521941
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In the standard Markov decision process formalism, users specify tasks by writing down a reward function. However, in many scenarios, the user is unable to describe the task in words or numbers, but can readily provide examples of what the world would look like if the task were solved. Motivated by this observation, we derive a control algorithm from first principles that aims to visit states that have a high probability of leading to successful outcomes, given only examples of successful outcome states. Prior work has approached similar problem settings in a two-stage process, first learning an auxiliary reward function and then optimizing this reward function using another reinforcement learning algorithm. In contrast, we derive a method based on recursive classification that eschews auxiliary reward functions and instead directly learns a value function from transitions and successful outcomes. Our method therefore requires fewer hyperparameters to tune and lines of code to debug. We show that our method satisfies a new data-driven Bellman equation, where examples take the place of the typical reward function term. Experiments show that our approach outperforms prior methods that learn explicit reward functions.

Related papers

Outcome-Based Online Reinforcement Learning: Algorithms and Fundamental Limits [58.63897489864948]
Reinforcement learning with outcome-based feedback faces a fundamental challenge.<n>How do we assign credit to the right actions?<n>This paper provides the first comprehensive analysis of this problem in online RL with general function approximation.
arXiv Detail & Related papers (2025-05-26T17:44:08Z)
Reward Adaptation Via Q-Manipulation [3.8065968624597324]
We propose a new solution to reward adaptation (RA), the problem where the learning agent adapts to a target reward function based on one or multiple existing behaviors. Our work represents a new approach to RA via the manipulation of Q-functions. We refer to such a method as Q-Manipulation (Q-M)
arXiv Detail & Related papers (2025-03-17T17:42:54Z)
Walking the Values in Bayesian Inverse Reinforcement Learning [66.68997022043075]
Key challenge in Bayesian IRL is bridging the computational gap between the hypothesis space of possible rewards and the likelihood. We propose ValueWalk - a new Markov chain Monte Carlo method based on this insight.
arXiv Detail & Related papers (2024-07-15T17:59:52Z)
A Generalized Acquisition Function for Preference-based Reward Learning [12.158619866176487]
Preference-based reward learning is a popular technique for teaching robots and autonomous systems how a human user wants them to perform a task. Previous works have shown that actively synthesizing preference queries to maximize information gain about the reward function parameters improves data efficiency. We show that it is possible to optimize for learning the reward function up to a behavioral equivalence class, such as inducing the same ranking over behaviors, distribution over choices, or other related definitions of what makes two rewards similar.
arXiv Detail & Related papers (2024-03-09T20:32:17Z)
STARC: A General Framework For Quantifying Differences Between Reward Functions [55.33869271912095]
We provide a class of pseudometrics on the space of all reward functions that we call STARC metrics. We show that STARC metrics induce both an upper and a lower bound on worst-case regret. We also identify a number of issues with reward metrics proposed by earlier works.
arXiv Detail & Related papers (2023-09-26T20:31:19Z)
Basis for Intentions: Efficient Inverse Reinforcement Learning using Past Experience [89.30876995059168]
inverse reinforcement learning (IRL) -- inferring the reward function of an agent from observing its behavior. This paper addresses the problem of IRL -- inferring the reward function of an agent from observing its behavior.
arXiv Detail & Related papers (2022-08-09T17:29:49Z)
Preprocessing Reward Functions for Interpretability [2.538209532048867]
We propose exploiting the intrinsic structure of reward functions by first preprocessing them into simpler but equivalent reward functions. Our empirical evaluation shows that preprocessed rewards are often significantly easier to understand than the original reward.
arXiv Detail & Related papers (2022-03-25T10:19:35Z)
Invariance in Policy Optimisation and Partial Identifiability in Reward Learning [67.4640841144101]
We characterise the partial identifiability of the reward function given popular reward learning data sources. We also analyse the impact of this partial identifiability for several downstream tasks, such as policy optimisation.
arXiv Detail & Related papers (2022-03-14T20:19:15Z)
Potential-based Reward Shaping in Sokoban [5.563631490799427]
We study whether we can use a search algorithm(A*) to automatically generate a potential function for reward shaping in Sokoban. Results showed that learning with shaped reward function is faster than learning from scratch. Results indicate that distance functions could be a suitable function for Sokoban.
arXiv Detail & Related papers (2021-09-10T06:28:09Z)
MURAL: Meta-Learning Uncertainty-Aware Rewards for Outcome-Driven Reinforcement Learning [65.52675802289775]
We show that an uncertainty aware classifier can solve challenging reinforcement learning problems. We propose a novel method for computing the normalized maximum likelihood (NML) distribution. We show that the resulting algorithm has a number of intriguing connections to both count-based exploration methods and prior algorithms for learning reward functions.
arXiv Detail & Related papers (2021-07-15T08:19:57Z)
Reward Shaping with Dynamic Trajectory Aggregation [7.6146285961466]
Potential-based reward shaping is a basic method for enriching rewards. SARSA-RS learns the potential function and acquires it. We propose a trajectory aggregation that uses subgoal series.
arXiv Detail & Related papers (2021-04-13T13:07:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.