Related papers: AWD3: Dynamic Reduction of the Estimation Bias

AWD3: Dynamic Reduction of the Estimation Bias

URL: http://arxiv.org/abs/2111.06780v1
Date: Fri, 12 Nov 2021 15:46:19 GMT
Title: AWD3: Dynamic Reduction of the Estimation Bias
Authors: Dogan C. Cicek, Enes Duran, Baturay Saglam, Kagan Kaya, Furkan B. Mutlu, Suleyman S. Kozat
Abstract summary: We introduce a technique that eliminates the estimation bias in off-policy continuous control algorithms using the experience replay mechanism. We show through continuous control environments of OpenAI gym that our algorithm matches or outperforms the state-of-the-art off-policy policy gradient learning algorithms.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Value-based deep Reinforcement Learning (RL) algorithms suffer from the estimation bias primarily caused by function approximation and temporal difference (TD) learning. This problem induces faulty state-action value estimates and therefore harms the performance and robustness of the learning algorithms. Although several techniques were proposed to tackle, learning algorithms still suffer from this bias. Here, we introduce a technique that eliminates the estimation bias in off-policy continuous control algorithms using the experience replay mechanism. We adaptively learn the weighting hyper-parameter beta in the Weighted Twin Delayed Deep Deterministic Policy Gradient algorithm. Our method is named Adaptive-WD3 (AWD3). We show through continuous control environments of OpenAI gym that our algorithm matches or outperforms the state-of-the-art off-policy policy gradient learning algorithms.

Related papers

Edge Delayed Deep Deterministic Policy Gradient: efficient continuous control for edge scenarios [5.446048322940114]
We introduce a novel Reinforcement Learning algorithm tailored for edge scenarios, called Edge Delayed Deep Deterministic Policy Gradient (EdgeD3) In this work, we introduce a novel Reinforcement Learning algorithm tailored for edge scenarios, called Edge Delayed Deep Deterministic Policy Gradient (EdgeD3)
arXiv Detail & Related papers (2024-12-09T11:17:04Z)
Backstepping Temporal Difference Learning [3.5823366350053325]
We propose a new convergent algorithm for off-policy TD-learning. Our method relies on the backstepping technique, which is widely used in nonlinear control theory. convergence of the proposed algorithm is experimentally verified in environments where the standard TD-learning is known to be unstable.
arXiv Detail & Related papers (2023-02-20T10:06:49Z)
Off-Policy Deep Reinforcement Learning Algorithms for Handling Various Robotic Manipulator Tasks [0.0]
In this study, three reinforcement learning algorithms; DDPG, TD3 and SAC have been used to train Fetch robotic manipulator for four different tasks. All of these algorithms are off-policy and able to achieve their desired target by optimizing both policy and value functions.
arXiv Detail & Related papers (2022-12-11T18:25:24Z)
Adaptively Calibrated Critic Estimates for Deep Reinforcement Learning [36.643572071860554]
We propose a general method called Adaptively Calibrated Critics (ACC) ACC uses the most recent high variance but unbiased on-policy rollouts to alleviate the bias of the low variance temporal difference targets. We show that ACC is quite general by further applying it to TD3 and showing an improved performance also in this setting.
arXiv Detail & Related papers (2021-11-24T18:07:33Z)
Emphatic Algorithms for Deep Reinforcement Learning [43.17171330951343]
Temporal difference learning algorithms can become unstable when combined with function approximation and off-policy sampling. Emphatic temporal difference (ETD($lambda$) algorithm ensures convergence in the linear case by appropriately weighting the TD($lambda$) updates. We show that naively adapting ETD($lambda$) to popular deep reinforcement learning algorithms, which use forward view multi-step returns, results in poor performance.
arXiv Detail & Related papers (2021-06-21T12:11:39Z)
An Empirical Comparison of Off-policy Prediction Learning Algorithms on the Collision Task [9.207173776826403]
Off-policy prediction -- learning the value function for one policy from data generated while following another policy -- is one of the most challenging subproblems in reinforcement learning. This paper presents empirical results with eleven prominent off-policy learning algorithms that use linear function approximation.
arXiv Detail & Related papers (2021-06-02T03:45:43Z)
Learning Sampling Policy for Faster Derivative Free Optimization [100.27518340593284]
We propose a new reinforcement learning based ZO algorithm (ZO-RL) with learning the sampling policy for generating the perturbations in ZO optimization instead of using random sampling. Our results show that our ZO-RL algorithm can effectively reduce the variances of ZO gradient by learning a sampling policy, and converge faster than existing ZO algorithms in different scenarios.
arXiv Detail & Related papers (2021-04-09T14:50:59Z)
Evolving Reinforcement Learning Algorithms [186.62294652057062]
We propose a method for meta-learning reinforcement learning algorithms. The learned algorithms are domain-agnostic and can generalize to new environments not seen during training. We highlight two learned algorithms which obtain good generalization performance over other classical control tasks, gridworld type tasks, and Atari games.
arXiv Detail & Related papers (2021-01-08T18:55:07Z)
Average-Reward Off-Policy Policy Evaluation with Function Approximation [66.67075551933438]
We consider off-policy policy evaluation with function approximation in average-reward MDPs. bootstrapping is necessary and, along with off-policy learning and FA, results in the deadly triad. We propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting.
arXiv Detail & Related papers (2021-01-08T00:43:04Z)
Variance-Reduced Off-Policy Memory-Efficient Policy Search [61.23789485979057]
Off-policy policy optimization is a challenging problem in reinforcement learning. Off-policy algorithms are memory-efficient and capable of learning from off-policy samples.
arXiv Detail & Related papers (2020-09-14T16:22:46Z)
Discovering Reinforcement Learning Algorithms [53.72358280495428]
Reinforcement learning algorithms update an agent's parameters according to one of several possible rules. This paper introduces a new meta-learning approach that discovers an entire update rule. It includes both 'what to predict' (e.g. value functions) and 'how to learn from it' by interacting with a set of environments.
arXiv Detail & Related papers (2020-07-17T07:38:39Z)
WD3: Taming the Estimation Bias in Deep Reinforcement Learning [7.29018671106362]
We show that TD3 algorithm introduces underestimation bias in mild assumptions. We propose a novel algorithm underlineWeighted underlineDelayed underlineDeep underlineDeterministic Policy Gradient (WD3), which can eliminate the estimation bias.
arXiv Detail & Related papers (2020-06-18T01:28:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.