Parameter-Free Deterministic Reduction of the Estimation Bias in
Continuous Control
- URL: http://arxiv.org/abs/2109.11788v1
- Date: Fri, 24 Sep 2021 07:41:07 GMT
- Title: Parameter-Free Deterministic Reduction of the Estimation Bias in
Continuous Control
- Authors: Baturay Saglam, Enes Duran, Dogan C. Cicek, Furkan B. Mutlu, Suleyman
S. Kozat
- Abstract summary: We introduce a parameter-free, novel deep Q-learning variant to reduce this underestimation bias for continuous control.
We test the performance of our improvement on a set of MuJoCo and Box2D continuous control tasks.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Approximation of the value functions in value-based deep reinforcement
learning systems induces overestimation bias, resulting in suboptimal policies.
We show that when the reinforcement signals received by the agents have a high
variance, deep actor-critic approaches that overcome the overestimation bias
lead to a substantial underestimation bias. We introduce a parameter-free,
novel deep Q-learning variant to reduce this underestimation bias for
continuous control. By obtaining fixed weights in computing the critic
objective as a linear combination of the approximate critic functions, our
Q-value update rule integrates the concepts of Clipped Double Q-learning and
Maxmin Q-learning. We test the performance of our improvement on a set of
MuJoCo and Box2D continuous control tasks and find that it improves the
state-of-the-art and outperforms the baseline algorithms in the majority of the
environments.
Related papers
- A Perspective of Q-value Estimation on Offline-to-Online Reinforcement
Learning [54.48409201256968]
offline-to-online Reinforcement Learning (O2O RL) aims to improve the performance of offline pretrained policy using only a few online samples.
Most O2O methods focus on the balance between RL objective and pessimism, or the utilization of offline and online samples.
arXiv Detail & Related papers (2023-12-12T19:24:35Z) - Simultaneous Double Q-learning with Conservative Advantage Learning for
Actor-Critic Methods [133.85604983925282]
We propose Simultaneous Double Q-learning with Conservative Advantage Learning (SDQ-CAL)
Our algorithm realizes less biased value estimation and achieves state-of-the-art performance in a range of continuous control benchmark tasks.
arXiv Detail & Related papers (2022-05-08T09:17:16Z) - Temporal-Difference Value Estimation via Uncertainty-Guided Soft Updates [110.92598350897192]
Q-Learning has proven effective at learning a policy to perform control tasks.
estimation noise becomes a bias after the max operator in the policy improvement step.
We present Unbiased Soft Q-Learning (UQL), which extends the work of EQL from two action, finite state spaces to multi-action, infinite state Markov Decision Processes.
arXiv Detail & Related papers (2021-10-28T00:07:19Z) - Automating Control of Overestimation Bias for Continuous Reinforcement
Learning [65.63607016094305]
We present a data-driven approach for guiding bias correction.
We demonstrate its effectiveness on the Truncated Quantile Critics -- a state-of-the-art continuous control algorithm.
arXiv Detail & Related papers (2021-10-26T09:27:12Z) - On the Estimation Bias in Double Q-Learning [20.856485777692594]
Double Q-learning is not fully unbiased and suffers from underestimation bias.
We show that such underestimation bias may lead to multiple non-optimal fixed points under an approximated Bellman operator.
We propose a simple but effective approach as a partial fix for the underestimation bias in double Q-learning.
arXiv Detail & Related papers (2021-09-29T13:41:24Z) - Estimation Error Correction in Deep Reinforcement Learning for
Deterministic Actor-Critic Methods [0.0]
In value-based deep reinforcement learning methods, approximation of value functions induces overestimation bias and leads to suboptimal policies.
We show that in deep actor-critic methods that aim to overcome the overestimation bias, if the reinforcement signals received by the agent have a high variance, a significant underestimation bias arises.
To minimize the underestimation, we introduce a parameter-free, novel deep Q-learning variant.
arXiv Detail & Related papers (2021-09-22T13:49:35Z) - Cross Learning in Deep Q-Networks [82.20059754270302]
We propose a novel cross Q-learning algorithm, aim at alleviating the well-known overestimation problem in value-based reinforcement learning methods.
Our algorithm builds on double Q-learning, by maintaining a set of parallel models and estimate the Q-value based on a randomly selected network.
arXiv Detail & Related papers (2020-09-29T04:58:17Z) - Decorrelated Double Q-learning [4.982806898121435]
We introduce the decorrelated double Q-learning (D2Q) to reduce the correlation between value function approximators.
The experimental results on a suite of MuJoCo continuous control tasks demonstrate that our decorrelated double Q-learning can effectively improve the performance.
arXiv Detail & Related papers (2020-06-12T05:59:05Z) - Controlling Overestimation Bias with Truncated Mixture of Continuous
Distributional Quantile Critics [65.51757376525798]
Overestimation bias is one of the major impediments to accurate off-policy learning.
This paper investigates a novel way to alleviate the overestimation bias in a continuous control setting.
Our method---Truncated Quantile Critics, TQC,---blends three ideas: distributional representation of a critic, truncation of critics prediction, and ensembling of multiple critics.
arXiv Detail & Related papers (2020-05-08T19:52:26Z) - Distributional Soft Actor-Critic: Off-Policy Reinforcement Learning for
Addressing Value Estimation Errors [13.534873779043478]
We present a distributional soft actor-critic (DSAC) algorithm to improve the policy performance by mitigating Q-value overestimations.
We evaluate DSAC on the suite of MuJoCo continuous control tasks, achieving the state-of-the-art performance.
arXiv Detail & Related papers (2020-01-09T02:27:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.