Related papers: WD3: Taming the Estimation Bias in Deep Reinforcement Learning

WD3: Taming the Estimation Bias in Deep Reinforcement Learning

URL: http://arxiv.org/abs/2006.12622v2
Date: Sat, 4 Nov 2023 12:58:32 GMT
Title: WD3: Taming the Estimation Bias in Deep Reinforcement Learning
Authors: Qiang He, Xinwen Hou
Abstract summary: We show that TD3 algorithm introduces underestimation bias in mild assumptions. We propose a novel algorithm underlineWeighted underlineDelayed underlineDeep underlineDeterministic Policy Gradient (WD3), which can eliminate the estimation bias.
Score: 7.29018671106362
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The overestimation phenomenon caused by function approximation is a well-known issue in value-based reinforcement learning algorithms such as deep Q-networks and DDPG, which could lead to suboptimal policies. To address this issue, TD3 takes the minimum value between a pair of critics. In this paper, we show that the TD3 algorithm introduces underestimation bias in mild assumptions. To obtain a more precise estimation for value function, we unify these two opposites and propose a novel algorithm \underline{W}eighted \underline{D}elayed \underline{D}eep \underline{D}eterministic Policy Gradient (WD3), which can eliminate the estimation bias and further improve the performance by weighting a pair of critics. To demonstrate the effectiveness of WD3, we compare the learning process of value function between DDPG, TD3, and WD3. The results verify that our algorithm does eliminate the estimation error of value functions. Furthermore, we evaluate our algorithm on the continuous control tasks. We observe that in each test task, the performance of WD3 consistently outperforms, or at the very least matches, that of the state-of-the-art algorithms\footnote{Our code is available at~\href{https://sites.google.com/view/ictai20-wd3/}{https://sites.google.com/view/ictai20-wd3/}.}.

Related papers

Nearly Optimal Algorithms for Contextual Dueling Bandits from Adversarial Feedback [58.66941279460248]
Learning from human feedback plays an important role in aligning generative models, such as large language models (LLM) We study a model within this problem domain--contextual dueling bandits with adversarial feedback, where the true preference label can be flipped by an adversary. We propose an algorithm namely robust contextual dueling bandit (algo), which is based on uncertainty-weighted maximum likelihood estimation.
arXiv Detail & Related papers (2024-04-16T17:59:55Z)
Scalable 3D Registration via Truncated Entry-wise Absolute Residuals [65.04922801371363]
A $3$D registration approach can process more than ten million ($107$) point pairs with over $99%$ random outliers. We call our method TEAR, as it involves minimizing an outlier-robust loss that computes Truncated Entry-wise Absolute Residuals.
arXiv Detail & Related papers (2024-04-01T04:43:39Z)
NeRF-Det++: Incorporating Semantic Cues and Perspective-aware Depth Supervision for Indoor Multi-View 3D Detection [72.0098999512727]
NeRF-Det has achieved impressive performance in indoor multi-view 3D detection by utilizing NeRF to enhance representation learning. We present three corresponding solutions, including semantic enhancement, perspective-aware sampling, and ordinal depth supervision. The resulting algorithm, NeRF-Det++, has exhibited appealing performance in the ScanNetV2 and AR KITScenes datasets.
arXiv Detail & Related papers (2024-02-22T11:48:06Z)
Rethinking PGD Attack: Is Sign Function Necessary? [131.6894310945647]
We present a theoretical analysis of how such sign-based update algorithm influences step-wise attack performance. We propose a new raw gradient descent (RGD) algorithm that eliminates the use of sign. The effectiveness of the proposed RGD algorithm has been demonstrated extensively in experiments.
arXiv Detail & Related papers (2023-12-03T02:26:58Z)
Asynchronous Training Schemes in Distributed Learning with Time Delay [17.259708772713164]
In the context of distributed deep learning, the issue of stale weights or gradients could result in poor algorithmic performance. In this paper, we propose a different approach to tackle the issue of stale weights or gradients. One practical variant of PC-ASGD is also proposed by adopting a condition to help with the determination of the tradeoff parameter.
arXiv Detail & Related papers (2022-08-28T07:14:59Z)
Value Activation for Bias Alleviation: Generalized-activated Deep Double Deterministic Policy Gradients [11.545991873249564]
It is vital to accurately estimate the value function in Deep Reinforcement Learning (DRL) Existing actor-critic methods suffer more or less from underestimation bias or overestimation bias. We propose a generalized-activated weighting operator that uses any non-decreasing function, namely activation function, as weights for better value estimation.
arXiv Detail & Related papers (2021-12-21T13:45:40Z)
AWD3: Dynamic Reduction of the Estimation Bias [0.0]
We introduce a technique that eliminates the estimation bias in off-policy continuous control algorithms using the experience replay mechanism. We show through continuous control environments of OpenAI gym that our algorithm matches or outperforms the state-of-the-art off-policy policy gradient learning algorithms.
arXiv Detail & Related papers (2021-11-12T15:46:19Z)
An Empirical Comparison of Off-policy Prediction Learning Algorithms on the Collision Task [9.207173776826403]
Off-policy prediction -- learning the value function for one policy from data generated while following another policy -- is one of the most challenging subproblems in reinforcement learning. This paper presents empirical results with eleven prominent off-policy learning algorithms that use linear function approximation.
arXiv Detail & Related papers (2021-06-02T03:45:43Z)
Average-Reward Off-Policy Policy Evaluation with Function Approximation [66.67075551933438]
We consider off-policy policy evaluation with function approximation in average-reward MDPs. bootstrapping is necessary and, along with off-policy learning and FA, results in the deadly triad. We propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting.
arXiv Detail & Related papers (2021-01-08T00:43:04Z)
Softmax Deep Double Deterministic Policy Gradients [37.23518654230526]
We propose to use the Boltzmann softmax operator for value function estimation in continuous control. We also design two new algorithms, Softmax Deep Deterministic Policy Gradients (SD2) and Softmax Deep Double Deterministic Policy Gradients (SD3), by building the softmax operator upon single and double estimators.
arXiv Detail & Related papers (2020-10-19T02:52:00Z)
Wasserstein Distances for Stereo Disparity Estimation [62.09272563885437]
Existing approaches to depth or disparity estimation output a distribution over a set of pre-defined discrete values. This leads to inaccurate results when the true depth or disparity does not match any of these values. We address these issues using a new neural network architecture that is capable of outputting arbitrary depth values.
arXiv Detail & Related papers (2020-07-06T21:37:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.