Off-Policy Reinforcement Learning with Delayed Rewards
- URL: http://arxiv.org/abs/2106.11854v1
- Date: Tue, 22 Jun 2021 15:19:48 GMT
- Title: Off-Policy Reinforcement Learning with Delayed Rewards
- Authors: Beining Han, Zhizhou Ren, Zuofan Wu, Yuan Zhou, Jian Peng
- Abstract summary: In many real-world tasks, instant rewards are not readily accessible or defined immediately after the agent performs actions.
In this work, we first formally define the environment with delayed rewards and discuss the challenges raised due to the non-Markovian nature of such environments.
We introduce a general off-policy RL framework with a new Q-function formulation that can handle the delayed rewards with theoretical convergence guarantees.
- Score: 16.914712720033524
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study deep reinforcement learning (RL) algorithms with delayed rewards. In
many real-world tasks, instant rewards are often not readily accessible or even
defined immediately after the agent performs actions. In this work, we first
formally define the environment with delayed rewards and discuss the challenges
raised due to the non-Markovian nature of such environments. Then, we introduce
a general off-policy RL framework with a new Q-function formulation that can
handle the delayed rewards with theoretical convergence guarantees. For
practical tasks with high dimensional state spaces, we further introduce the
HC-decomposition rule of the Q-function in our framework which naturally leads
to an approximation scheme that helps boost the training efficiency and
stability. We finally conduct extensive experiments to demonstrate the superior
performance of our algorithms over the existing work and their variants.
Related papers
- Rethinking Inverse Reinforcement Learning: from Data Alignment to Task Alignment [7.477559660351106]
imitation learning (IL) algorithms use inverse reinforcement learning (IRL) to infer a reward function that aligns with a demonstration.
We propose a novel framework for IRL-based IL that prioritizes task alignment over conventional data alignment.
arXiv Detail & Related papers (2024-10-31T07:08:14Z) - Beyond Simple Sum of Delayed Rewards: Non-Markovian Reward Modeling for Reinforcement Learning [44.770495418026734]
Reinforcement Learning (RL) empowers agents to acquire various skills by learning from reward signals.
Traditional methods assume the existence of underlying Markovian rewards and that the observed delayed reward is simply the sum of instance-level rewards.
We propose Composite Delayed Reward Transformer (CoDeTr), which incorporates a specialized in-sequence attention mechanism.
arXiv Detail & Related papers (2024-10-26T13:12:27Z) - Sample-Efficient Curriculum Reinforcement Learning for Complex Reward Functions [5.78463306498655]
Reinforcement learning (RL) shows promise in control problems, but its practical application is often hindered by the complexity arising from intricate reward functions with constraints.
We propose a novel two-stage reward curriculum combined with a flexible replay buffer that adaptively samples experiences.
Our approach first learns on a subset of rewards before transitioning to the full reward, allowing the agent to learn trade-offs between objectives and constraints.
arXiv Detail & Related papers (2024-10-22T08:07:44Z) - Hierarchical Decomposition of Prompt-Based Continual Learning:
Rethinking Obscured Sub-optimality [55.88910947643436]
Self-supervised pre-training is essential for handling vast quantities of unlabeled data in practice.
HiDe-Prompt is an innovative approach that explicitly optimize the hierarchical components with an ensemble of task-specific prompts and statistics.
Our experiments demonstrate the superior performance of HiDe-Prompt and its robustness to pre-training paradigms in continual learning.
arXiv Detail & Related papers (2023-10-11T06:51:46Z) - Dealing with Sparse Rewards in Continuous Control Robotics via
Heavy-Tailed Policies [64.2210390071609]
We present a novel Heavy-Tailed Policy Gradient (HT-PSG) algorithm to deal with the challenges of sparse rewards in continuous control problems.
We show consistent performance improvement across all tasks in terms of high average cumulative reward.
arXiv Detail & Related papers (2022-06-12T04:09:39Z) - Learning Dynamics and Generalization in Reinforcement Learning [59.530058000689884]
We show theoretically that temporal difference learning encourages agents to fit non-smooth components of the value function early in training.
We show that neural networks trained using temporal difference algorithms on dense reward tasks exhibit weaker generalization between states than randomly networks and gradient networks trained with policy methods.
arXiv Detail & Related papers (2022-06-05T08:49:16Z) - A State-Distribution Matching Approach to Non-Episodic Reinforcement
Learning [61.406020873047794]
A major hurdle to real-world application arises from the development of algorithms in an episodic setting.
We propose a new method, MEDAL, that trains the backward policy to match the state distribution in the provided demonstrations.
Our experiments show that MEDAL matches or outperforms prior methods on three sparse-reward continuous control tasks.
arXiv Detail & Related papers (2022-05-11T00:06:29Z) - Learning Long-Term Reward Redistribution via Randomized Return
Decomposition [18.47810850195995]
We consider the problem formulation of episodic reinforcement learning with trajectory feedback.
It refers to an extreme delay of reward signals, in which the agent can only obtain one reward signal at the end of each trajectory.
We propose a novel reward redistribution algorithm, randomized return decomposition (RRD), to learn a proxy reward function for episodic reinforcement learning.
arXiv Detail & Related papers (2021-11-26T13:23:36Z) - Learning Guidance Rewards with Trajectory-space Smoothing [22.456737935789103]
Long-term temporal credit assignment is an important challenge in deep reinforcement learning.
Existing policy-gradient and Q-learning algorithms rely on dense environmental rewards that provide rich short-term supervision.
Recent works have proposed algorithms to learn dense "guidance" rewards that could be used in place of the sparse or delayed environmental rewards.
arXiv Detail & Related papers (2020-10-23T23:55:06Z) - SUNRISE: A Simple Unified Framework for Ensemble Learning in Deep
Reinforcement Learning [102.78958681141577]
We present SUNRISE, a simple unified ensemble method, which is compatible with various off-policy deep reinforcement learning algorithms.
SUNRISE integrates two key ingredients: (a) ensemble-based weighted Bellman backups, which re-weight target Q-values based on uncertainty estimates from a Q-ensemble, and (b) an inference method that selects actions using the highest upper-confidence bounds for efficient exploration.
arXiv Detail & Related papers (2020-07-09T17:08:44Z) - Corruption-robust exploration in episodic reinforcement learning [76.19192549843727]
We study multi-stage episodic reinforcement learning under adversarial corruptions in both the rewards and the transition probabilities of the underlying system.
Our framework yields efficient algorithms which attain near-optimal regret in the absence of corruptions.
Notably, our work provides the first sublinear regret guarantee which any deviation from purely i.i.d. transitions in the bandit-feedback model for episodic reinforcement learning.
arXiv Detail & Related papers (2019-11-20T03:49:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.