Learning Guidance Rewards with Trajectory-space Smoothing
- URL: http://arxiv.org/abs/2010.12718v1
- Date: Fri, 23 Oct 2020 23:55:06 GMT
- Title: Learning Guidance Rewards with Trajectory-space Smoothing
- Authors: Tanmay Gangwani, Yuan Zhou, Jian Peng
- Abstract summary: Long-term temporal credit assignment is an important challenge in deep reinforcement learning.
Existing policy-gradient and Q-learning algorithms rely on dense environmental rewards that provide rich short-term supervision.
Recent works have proposed algorithms to learn dense "guidance" rewards that could be used in place of the sparse or delayed environmental rewards.
- Score: 22.456737935789103
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Long-term temporal credit assignment is an important challenge in deep
reinforcement learning (RL). It refers to the ability of the agent to attribute
actions to consequences that may occur after a long time interval. Existing
policy-gradient and Q-learning algorithms typically rely on dense environmental
rewards that provide rich short-term supervision and help with credit
assignment. However, they struggle to solve tasks with delays between an action
and the corresponding rewarding feedback. To make credit assignment easier,
recent works have proposed algorithms to learn dense "guidance" rewards that
could be used in place of the sparse or delayed environmental rewards. This
paper is in the same vein -- starting with a surrogate RL objective that
involves smoothing in the trajectory-space, we arrive at a new algorithm for
learning guidance rewards. We show that the guidance rewards have an intuitive
interpretation, and can be obtained without training any additional neural
networks. Due to the ease of integration, we use the guidance rewards in a few
popular algorithms (Q-learning, Actor-Critic, Distributional-RL) and present
results in single-agent and multi-agent tasks that elucidate the benefit of our
approach when the environmental rewards are sparse or delayed.
Related papers
- Accelerating Exploration with Unlabeled Prior Data [66.43995032226466]
We study how prior data without reward labels may be used to guide and accelerate exploration for an agent solving a new sparse reward task.
We propose a simple approach that learns a reward model from online experience, labels the unlabeled prior data with optimistic rewards, and then uses it concurrently alongside the online data for downstream policy and critic optimization.
arXiv Detail & Related papers (2023-11-09T00:05:17Z) - ForkMerge: Mitigating Negative Transfer in Auxiliary-Task Learning [59.08197876733052]
Auxiliary-Task Learning (ATL) aims to improve the performance of the target task by leveraging the knowledge obtained from related tasks.
Sometimes, learning multiple tasks simultaneously results in lower accuracy than learning only the target task, known as negative transfer.
ForkMerge is a novel approach that periodically forks the model into multiple branches, automatically searches the varying task weights.
arXiv Detail & Related papers (2023-01-30T02:27:02Z) - Actively Learning Costly Reward Functions for Reinforcement Learning [56.34005280792013]
We show that it is possible to train agents in complex real-world environments orders of magnitudes faster.
By enabling the application of reinforcement learning methods to new domains, we show that we can find interesting and non-trivial solutions.
arXiv Detail & Related papers (2022-11-23T19:17:20Z) - Basis for Intentions: Efficient Inverse Reinforcement Learning using
Past Experience [89.30876995059168]
inverse reinforcement learning (IRL) -- inferring the reward function of an agent from observing its behavior.
This paper addresses the problem of IRL -- inferring the reward function of an agent from observing its behavior.
arXiv Detail & Related papers (2022-08-09T17:29:49Z) - Learning Long-Term Reward Redistribution via Randomized Return
Decomposition [18.47810850195995]
We consider the problem formulation of episodic reinforcement learning with trajectory feedback.
It refers to an extreme delay of reward signals, in which the agent can only obtain one reward signal at the end of each trajectory.
We propose a novel reward redistribution algorithm, randomized return decomposition (RRD), to learn a proxy reward function for episodic reinforcement learning.
arXiv Detail & Related papers (2021-11-26T13:23:36Z) - MURAL: Meta-Learning Uncertainty-Aware Rewards for Outcome-Driven
Reinforcement Learning [65.52675802289775]
We show that an uncertainty aware classifier can solve challenging reinforcement learning problems.
We propose a novel method for computing the normalized maximum likelihood (NML) distribution.
We show that the resulting algorithm has a number of intriguing connections to both count-based exploration methods and prior algorithms for learning reward functions.
arXiv Detail & Related papers (2021-07-15T08:19:57Z) - Computational Benefits of Intermediate Rewards for Hierarchical Planning [42.579256546135866]
We show that using intermediate rewards reduces the computational complexity in finding a successful policy but does not guarantee to find the shortest path.
We also corroborate our theoretical results with extensive experiments on the MiniGrid environments using Q-learning and other popular deep RL algorithms.
arXiv Detail & Related papers (2021-07-08T16:39:13Z) - Off-Policy Reinforcement Learning with Delayed Rewards [16.914712720033524]
In many real-world tasks, instant rewards are not readily accessible or defined immediately after the agent performs actions.
In this work, we first formally define the environment with delayed rewards and discuss the challenges raised due to the non-Markovian nature of such environments.
We introduce a general off-policy RL framework with a new Q-function formulation that can handle the delayed rewards with theoretical convergence guarantees.
arXiv Detail & Related papers (2021-06-22T15:19:48Z) - Semi-supervised reward learning for offline reinforcement learning [71.6909757718301]
Training agents usually requires reward functions, but rewards are seldom available in practice and their engineering is challenging and laborious.
We propose semi-supervised learning algorithms that learn from limited annotations and incorporate unlabelled data.
In our experiments with a simulated robotic arm, we greatly improve upon behavioural cloning and closely approach the performance achieved with ground truth rewards.
arXiv Detail & Related papers (2020-12-12T20:06:15Z) - Learning Intrinsic Symbolic Rewards in Reinforcement Learning [7.101885582663675]
We present a method that discovers dense rewards in the form of low-dimensional symbolic trees.
We show that the discovered dense rewards are an effective signal for an RL policy to solve the benchmark tasks.
arXiv Detail & Related papers (2020-10-08T00:02:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.