Related papers: Learning Guidance Rewards with Trajectory-space Smoothing

Learning Guidance Rewards with Trajectory-space Smoothing

URL: http://arxiv.org/abs/2010.12718v1
Date: Fri, 23 Oct 2020 23:55:06 GMT
Title: Learning Guidance Rewards with Trajectory-space Smoothing
Authors: Tanmay Gangwani, Yuan Zhou, Jian Peng
Abstract summary: Long-term temporal credit assignment is an important challenge in deep reinforcement learning. Existing policy-gradient and Q-learning algorithms rely on dense environmental rewards that provide rich short-term supervision. Recent works have proposed algorithms to learn dense "guidance" rewards that could be used in place of the sparse or delayed environmental rewards.
Score: 22.456737935789103
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Long-term temporal credit assignment is an important challenge in deep reinforcement learning (RL). It refers to the ability of the agent to attribute actions to consequences that may occur after a long time interval. Existing policy-gradient and Q-learning algorithms typically rely on dense environmental rewards that provide rich short-term supervision and help with credit assignment. However, they struggle to solve tasks with delays between an action and the corresponding rewarding feedback. To make credit assignment easier, recent works have proposed algorithms to learn dense "guidance" rewards that could be used in place of the sparse or delayed environmental rewards. This paper is in the same vein -- starting with a surrogate RL objective that involves smoothing in the trajectory-space, we arrive at a new algorithm for learning guidance rewards. We show that the guidance rewards have an intuitive interpretation, and can be obtained without training any additional neural networks. Due to the ease of integration, we use the guidance rewards in a few popular algorithms (Q-learning, Actor-Critic, Distributional-RL) and present results in single-agent and multi-agent tasks that elucidate the benefit of our approach when the environmental rewards are sparse or delayed.

Related papers

Agent-Temporal Credit Assignment for Optimal Policy Preservation in Sparse Multi-Agent Reinforcement Learning [14.003793644193605]
In multi-agent environments, agents often struggle to learn optimal policies due to sparse or delayed global rewards. We introduce Temporal-Agent Reward Redistribution (TAR$2$), a novel approach designed to address the agent-temporal credit assignment problem. TAR$2$ decomposes sparse global rewards into time-step-specific rewards and calculates agent-specific contributions to these rewards.
arXiv Detail & Related papers (2024-12-19T12:05:13Z)
Dense Dynamics-Aware Reward Synthesis: Integrating Prior Experience with Demonstrations [24.041217922654738]
Continuous control problems can be formulated as sparse-reward reinforcement learning (RL) tasks. Online RL methods can automatically explore the state space to solve each new task. However, discovering sequences of actions that lead to a non-zero reward becomes exponentially more difficult as the task horizon increases. We introduce a systematic reward-shaping framework that distills the information contained in 1) a task-agnostic prior data set and 2) a small number of task-specific expert demonstrations.
arXiv Detail & Related papers (2024-12-02T04:37:12Z)
REBEL: Reward Regularization-Based Approach for Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and human preferences can lead to catastrophic outcomes in the real world. Recent methods aim to mitigate misalignment by learning reward functions from human preferences. We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z)
Accelerating Exploration with Unlabeled Prior Data [66.43995032226466]
We study how prior data without reward labels may be used to guide and accelerate exploration for an agent solving a new sparse reward task. We propose a simple approach that learns a reward model from online experience, labels the unlabeled prior data with optimistic rewards, and then uses it concurrently alongside the online data for downstream policy and critic optimization.
arXiv Detail & Related papers (2023-11-09T00:05:17Z)
ForkMerge: Mitigating Negative Transfer in Auxiliary-Task Learning [59.08197876733052]
Auxiliary-Task Learning (ATL) aims to improve the performance of the target task by leveraging the knowledge obtained from related tasks. Sometimes, learning multiple tasks simultaneously results in lower accuracy than learning only the target task, known as negative transfer. ForkMerge is a novel approach that periodically forks the model into multiple branches, automatically searches the varying task weights.
arXiv Detail & Related papers (2023-01-30T02:27:02Z)
Actively Learning Costly Reward Functions for Reinforcement Learning [56.34005280792013]
We show that it is possible to train agents in complex real-world environments orders of magnitudes faster. By enabling the application of reinforcement learning methods to new domains, we show that we can find interesting and non-trivial solutions.
arXiv Detail & Related papers (2022-11-23T19:17:20Z)
Basis for Intentions: Efficient Inverse Reinforcement Learning using Past Experience [89.30876995059168]
inverse reinforcement learning (IRL) -- inferring the reward function of an agent from observing its behavior. This paper addresses the problem of IRL -- inferring the reward function of an agent from observing its behavior.
arXiv Detail & Related papers (2022-08-09T17:29:49Z)
Learning Long-Term Reward Redistribution via Randomized Return Decomposition [18.47810850195995]
We consider the problem formulation of episodic reinforcement learning with trajectory feedback. It refers to an extreme delay of reward signals, in which the agent can only obtain one reward signal at the end of each trajectory. We propose a novel reward redistribution algorithm, randomized return decomposition (RRD), to learn a proxy reward function for episodic reinforcement learning.
arXiv Detail & Related papers (2021-11-26T13:23:36Z)
MURAL: Meta-Learning Uncertainty-Aware Rewards for Outcome-Driven Reinforcement Learning [65.52675802289775]
We show that an uncertainty aware classifier can solve challenging reinforcement learning problems. We propose a novel method for computing the normalized maximum likelihood (NML) distribution. We show that the resulting algorithm has a number of intriguing connections to both count-based exploration methods and prior algorithms for learning reward functions.
arXiv Detail & Related papers (2021-07-15T08:19:57Z)
Computational Benefits of Intermediate Rewards for Hierarchical Planning [42.579256546135866]
We show that using intermediate rewards reduces the computational complexity in finding a successful policy but does not guarantee to find the shortest path. We also corroborate our theoretical results with extensive experiments on the MiniGrid environments using Q-learning and other popular deep RL algorithms.
arXiv Detail & Related papers (2021-07-08T16:39:13Z)
Off-Policy Reinforcement Learning with Delayed Rewards [16.914712720033524]
In many real-world tasks, instant rewards are not readily accessible or defined immediately after the agent performs actions. In this work, we first formally define the environment with delayed rewards and discuss the challenges raised due to the non-Markovian nature of such environments. We introduce a general off-policy RL framework with a new Q-function formulation that can handle the delayed rewards with theoretical convergence guarantees.
arXiv Detail & Related papers (2021-06-22T15:19:48Z)
Semi-supervised reward learning for offline reinforcement learning [71.6909757718301]
Training agents usually requires reward functions, but rewards are seldom available in practice and their engineering is challenging and laborious. We propose semi-supervised learning algorithms that learn from limited annotations and incorporate unlabelled data. In our experiments with a simulated robotic arm, we greatly improve upon behavioural cloning and closely approach the performance achieved with ground truth rewards.
arXiv Detail & Related papers (2020-12-12T20:06:15Z)
Learning Intrinsic Symbolic Rewards in Reinforcement Learning [7.101885582663675]
We present a method that discovers dense rewards in the form of low-dimensional symbolic trees. We show that the discovered dense rewards are an effective signal for an RL policy to solve the benchmark tasks.
arXiv Detail & Related papers (2020-10-08T00:02:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.