Reinforcement Learning with Random Delays
- URL: http://arxiv.org/abs/2010.02966v3
- Date: Tue, 4 May 2021 20:27:33 GMT
- Title: Reinforcement Learning with Random Delays
- Authors: Simon Ramstedt, Yann Bouteiller, Giovanni Beltrame, Christopher Pal,
Jonathan Binas
- Abstract summary: We show that partially resampling trajectory fragments in hindsight allows for off-policy multi-step value estimation.
We apply this principle to derive Delay-Correcting Actor-Critic (DCAC), an algorithm based on Soft Actor-Critic with significantly better performance in environments with delays.
- Score: 14.707955337702943
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Action and observation delays commonly occur in many Reinforcement Learning
applications, such as remote control scenarios. We study the anatomy of
randomly delayed environments, and show that partially resampling trajectory
fragments in hindsight allows for off-policy multi-step value estimation. We
apply this principle to derive Delay-Correcting Actor-Critic (DCAC), an
algorithm based on Soft Actor-Critic with significantly better performance in
environments with delays. This is shown theoretically and also demonstrated
practically on a delay-augmented version of the MuJoCo continuous control
benchmark.
Related papers
- DEER: A Delay-Resilient Framework for Reinforcement Learning with Variable Delays [26.032139258562708]
We propose $textbfDEER (Delay-resilient-Enhanced RL)$, a framework designed to effectively enhance the interpretability and address the random delay issues.
In a variety of delayed scenarios, the trained encoder can seamlessly integrate with standard RL algorithms without requiring additional modifications.
The results confirm that DEER is superior to state-of-the-art RL algorithms in both constant and random delay settings.
arXiv Detail & Related papers (2024-06-05T09:45:26Z) - Growing Q-Networks: Solving Continuous Control Tasks with Adaptive Control Resolution [51.83951489847344]
In robotics applications, smooth control signals are commonly preferred to reduce system wear and energy efficiency.
In this work, we aim to bridge this performance gap by growing discrete action spaces from coarse to fine control resolution.
Our work indicates that an adaptive control resolution in combination with value decomposition yields simple critic-only algorithms that yield surprisingly strong performance on continuous control tasks.
arXiv Detail & Related papers (2024-04-05T17:58:37Z) - Stochastic Approximation with Delayed Updates: Finite-Time Rates under Markovian Sampling [73.5602474095954]
We study the non-asymptotic performance of approximation schemes with delayed updates under Markovian sampling.
Our theoretical findings shed light on the finite-time effects of delays for a broad class of algorithms.
arXiv Detail & Related papers (2024-02-19T03:08:02Z) - Boosting Reinforcement Learning with Strongly Delayed Feedback Through Auxiliary Short Delays [41.52768902667611]
Reinforcement learning (RL) is challenging in the common case of delays between events and their sensory perceptions.
We present a novel Auxiliary-Delayed Reinforcement Learning (AD-RL) method that leverages auxiliary tasks involving short delays to accelerate RL with long delays.
Specifically, AD-RL learns a value function for short delays and uses bootstrapping and policy improvement techniques to adjust it for long delays.
arXiv Detail & Related papers (2024-02-05T16:11:03Z) - Posterior Sampling with Delayed Feedback for Reinforcement Learning with
Linear Function Approximation [62.969796245827006]
Delayed-PSVI is an optimistic value-based algorithm that explores the value function space via noise perturbation with posterior sampling.
We show our algorithm achieves $widetildeO(sqrtd3H3 T + d2H2 E[tau]$ worst-case regret in the presence of unknown delays.
We incorporate a gradient-based approximate sampling scheme via Langevin dynamics for Delayed-LPSVI.
arXiv Detail & Related papers (2023-10-29T06:12:43Z) - Neural Laplace Control for Continuous-time Delayed Systems [76.81202657759222]
We propose a continuous-time model-based offline RL method that combines a Neural Laplace dynamics model with a model predictive control (MPC) planner.
We show experimentally on continuous-time delayed environments it is able to achieve near expert policy performance.
arXiv Detail & Related papers (2023-02-24T12:40:28Z) - Revisiting State Augmentation methods for Reinforcement Learning with
Stochastic Delays [10.484851004093919]
This paper formally describes the notion of Markov Decision Processes (MDPs) with delays.
We show that delayed MDPs can be transformed into equivalent standard MDPs (without delays) with significantly simplified cost structure.
We employ this equivalence to derive a model-free Delay-Resolved RL framework and show that even a simple RL algorithm built upon this framework achieves near-optimal rewards in environments with delays in actions and observations.
arXiv Detail & Related papers (2021-08-17T10:45:55Z) - Critical Parameters for Scalable Distributed Learning with Large Batches
and Asynchronous Updates [67.19481956584465]
It has been experimentally observed that the efficiency of distributed training with saturation (SGD) depends decisively on the batch size and -- in implementations -- on the staleness.
We show that our results are tight and illustrate key findings in numerical experiments.
arXiv Detail & Related papers (2021-03-03T12:08:23Z) - Intrinsic Temporal Regularization for High-resolution Human Video
Synthesis [59.54483950973432]
temporal consistency is crucial for extending image processing pipelines to the video domain.
We propose an effective intrinsic temporal regularization scheme, where an intrinsic confidence map is estimated via the frame generator to regulate motion estimation.
We apply our intrinsic temporal regulation to single-image generator, leading to a powerful " INTERnet" capable of generating $512times512$ resolution human action videos.
arXiv Detail & Related papers (2020-12-11T05:29:45Z) - Non-Stationary Delayed Bandits with Intermediate Observations [10.538264213183076]
Online recommender systems often face long delays in receiving feedback, especially when optimizing for some long-term metrics.
We introduce the problem of non-stationary, delayed bandits with intermediate observations.
We develop an efficient algorithm based on UCRL, and prove sublinear regret guarantees for its performance.
arXiv Detail & Related papers (2020-06-03T09:27:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.