Efficient Continuous Control with Double Actors and Regularized Critics
- URL: http://arxiv.org/abs/2106.03050v1
- Date: Sun, 6 Jun 2021 07:04:48 GMT
- Title: Efficient Continuous Control with Double Actors and Regularized Critics
- Authors: Jiafei Lyu, Xiaoteng Ma, Jiangpeng Yan, Xiu Li
- Abstract summary: We explore the potential of double actors, which has been neglected for a long time, for better value function estimation in continuous setting.
We build double actors upon single critic and double critics to handle overestimation bias in DDPG and underestimation bias in TD3 respectively.
To mitigate the uncertainty of value estimate from double critics, we propose to regularize the critic networks under double actors architecture.
- Score: 7.072664211491016
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: How to obtain good value estimation is one of the key problems in
Reinforcement Learning (RL). Current value estimation methods, such as DDPG and
TD3, suffer from unnecessary over- or underestimation bias. In this paper, we
explore the potential of double actors, which has been neglected for a long
time, for better value function estimation in continuous setting. First, we
uncover and demonstrate the bias alleviation property of double actors by
building double actors upon single critic and double critics to handle
overestimation bias in DDPG and underestimation bias in TD3 respectively. Next,
we interestingly find that double actors help improve the exploration ability
of the agent. Finally, to mitigate the uncertainty of value estimate from
double critics, we further propose to regularize the critic networks under
double actors architecture, which gives rise to Double Actors Regularized
Critics (DARC) algorithm. Extensive experimental results on challenging
continuous control tasks show that DARC significantly outperforms
state-of-the-art methods with higher sample efficiency.
Related papers
- Perturbation-Invariant Adversarial Training for Neural Ranking Models:
Improving the Effectiveness-Robustness Trade-Off [107.35833747750446]
adversarial examples can be crafted by adding imperceptible perturbations to legitimate documents.
This vulnerability raises significant concerns about their reliability and hinders the widespread deployment of NRMs.
In this study, we establish theoretical guarantees regarding the effectiveness-robustness trade-off in NRMs.
arXiv Detail & Related papers (2023-12-16T05:38:39Z) - Mitigating Estimation Errors by Twin TD-Regularized Actor and Critic for
Deep Reinforcement Learning [10.577516871906816]
We introduce a new, twin TD-regularized actor-critic (TDR) method to address the issue of estimation bias in deep reinforcement learning (DRL)
We show that our new actor-critic learning has enabled DRL methods to outperform their respective baselines in challenging environments in DeepMind Control Suite.
arXiv Detail & Related papers (2023-11-07T04:30:51Z) - Ladder-of-Thought: Using Knowledge as Steps to Elevate Stance Detection [73.31406286956535]
We introduce the Ladder-of-Thought (LoT) for the stance detection task.
LoT directs the small LMs to assimilate high-quality external knowledge, refining the intermediate rationales produced.
Our empirical evaluations underscore LoT's efficacy, marking a 16% improvement over GPT-3.5 and a 10% enhancement compared to GPT-3.5 with CoT on stance detection task.
arXiv Detail & Related papers (2023-08-31T14:31:48Z) - Actor-Director-Critic: A Novel Deep Reinforcement Learning Framework [2.6477113498726244]
We propose actor-director-critic, a new framework for deep reinforcement learning.
For the two critic networks used, we design two target critic networks for each critic network instead of one.
In order to verify the performance of the actor-director-critic framework and the improved double estimator method, we applied them to the TD3 algorithm.
arXiv Detail & Related papers (2023-01-10T10:21:32Z) - Simultaneous Double Q-learning with Conservative Advantage Learning for
Actor-Critic Methods [133.85604983925282]
We propose Simultaneous Double Q-learning with Conservative Advantage Learning (SDQ-CAL)
Our algorithm realizes less biased value estimation and achieves state-of-the-art performance in a range of continuous control benchmark tasks.
arXiv Detail & Related papers (2022-05-08T09:17:16Z) - Doubly Robust Distributionally Robust Off-Policy Evaluation and Learning [59.02006924867438]
Off-policy evaluation and learning (OPE/L) use offline observational data to make better decisions.
Recent work proposed distributionally robust OPE/L (DROPE/L) to remedy this, but the proposal relies on inverse-propensity weighting.
We propose the first DR algorithms for DROPE/L with KL-divergence uncertainty sets.
arXiv Detail & Related papers (2022-02-19T20:00:44Z) - Value Activation for Bias Alleviation: Generalized-activated Deep Double
Deterministic Policy Gradients [11.545991873249564]
It is vital to accurately estimate the value function in Deep Reinforcement Learning (DRL)
Existing actor-critic methods suffer more or less from underestimation bias or overestimation bias.
We propose a generalized-activated weighting operator that uses any non-decreasing function, namely activation function, as weights for better value estimation.
arXiv Detail & Related papers (2021-12-21T13:45:40Z) - Parameter-Free Deterministic Reduction of the Estimation Bias in
Continuous Control [0.0]
We introduce a parameter-free, novel deep Q-learning variant to reduce this underestimation bias for continuous control.
We test the performance of our improvement on a set of MuJoCo and Box2D continuous control tasks.
arXiv Detail & Related papers (2021-09-24T07:41:07Z) - Doubly Robust Off-Policy Actor-Critic: Convergence and Optimality [131.45028999325797]
We develop a doubly robust off-policy AC (DR-Off-PAC) for discounted MDP.
DR-Off-PAC adopts a single timescale structure, in which both actor and critics are updated simultaneously with constant stepsize.
We study the finite-time convergence rate and characterize the sample complexity for DR-Off-PAC to attain an $epsilon$-accurate optimal policy.
arXiv Detail & Related papers (2021-02-23T18:56:13Z) - WD3: Taming the Estimation Bias in Deep Reinforcement Learning [7.29018671106362]
We show that TD3 algorithm introduces underestimation bias in mild assumptions.
We propose a novel algorithm underlineWeighted underlineDelayed underlineDeep underlineDeterministic Policy Gradient (WD3), which can eliminate the estimation bias.
arXiv Detail & Related papers (2020-06-18T01:28:07Z) - Controlling Overestimation Bias with Truncated Mixture of Continuous
Distributional Quantile Critics [65.51757376525798]
Overestimation bias is one of the major impediments to accurate off-policy learning.
This paper investigates a novel way to alleviate the overestimation bias in a continuous control setting.
Our method---Truncated Quantile Critics, TQC,---blends three ideas: distributional representation of a critic, truncation of critics prediction, and ensembling of multiple critics.
arXiv Detail & Related papers (2020-05-08T19:52:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.