Softmax Deep Double Deterministic Policy Gradients
- URL: http://arxiv.org/abs/2010.09177v1
- Date: Mon, 19 Oct 2020 02:52:00 GMT
- Title: Softmax Deep Double Deterministic Policy Gradients
- Authors: Ling Pan, Qingpeng Cai, Longbo Huang
- Abstract summary: We propose to use the Boltzmann softmax operator for value function estimation in continuous control.
We also design two new algorithms, Softmax Deep Deterministic Policy Gradients (SD2) and Softmax Deep Double Deterministic Policy Gradients (SD3), by building the softmax operator upon single and double estimators.
- Score: 37.23518654230526
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A widely-used actor-critic reinforcement learning algorithm for continuous
control, Deep Deterministic Policy Gradients (DDPG), suffers from the
overestimation problem, which can negatively affect the performance. Although
the state-of-the-art Twin Delayed Deep Deterministic Policy Gradient (TD3)
algorithm mitigates the overestimation issue, it can lead to a large
underestimation bias. In this paper, we propose to use the Boltzmann softmax
operator for value function estimation in continuous control. We first
theoretically analyze the softmax operator in continuous action space. Then, we
uncover an important property of the softmax operator in actor-critic
algorithms, i.e., it helps to smooth the optimization landscape, which sheds
new light on the benefits of the operator. We also design two new algorithms,
Softmax Deep Deterministic Policy Gradients (SD2) and Softmax Deep Double
Deterministic Policy Gradients (SD3), by building the softmax operator upon
single and double estimators, which can effectively improve the overestimation
and underestimation bias. We conduct extensive experiments on challenging
continuous control tasks, and results show that SD3 outperforms
state-of-the-art methods.
Related papers
- Bridging Discrete and Backpropagation: Straight-Through and Beyond [62.46558842476455]
We propose a novel approach to approximate the gradient of parameters involved in generating discrete latent variables.
We propose ReinMax, which achieves second-order accuracy by integrating Heun's method, a second-order numerical method for solving ODEs.
arXiv Detail & Related papers (2023-04-17T20:59:49Z) - Value Activation for Bias Alleviation: Generalized-activated Deep Double
Deterministic Policy Gradients [11.545991873249564]
It is vital to accurately estimate the value function in Deep Reinforcement Learning (DRL)
Existing actor-critic methods suffer more or less from underestimation bias or overestimation bias.
We propose a generalized-activated weighting operator that uses any non-decreasing function, namely activation function, as weights for better value estimation.
arXiv Detail & Related papers (2021-12-21T13:45:40Z) - AWD3: Dynamic Reduction of the Estimation Bias [0.0]
We introduce a technique that eliminates the estimation bias in off-policy continuous control algorithms using the experience replay mechanism.
We show through continuous control environments of OpenAI gym that our algorithm matches or outperforms the state-of-the-art off-policy policy gradient learning algorithms.
arXiv Detail & Related papers (2021-11-12T15:46:19Z) - Minimax Optimization with Smooth Algorithmic Adversaries [59.47122537182611]
We propose a new algorithm for the min-player against smooth algorithms deployed by an adversary.
Our algorithm is guaranteed to make monotonic progress having no limit cycles, and to find an appropriate number of gradient ascents.
arXiv Detail & Related papers (2021-06-02T22:03:36Z) - Correcting Momentum with Second-order Information [50.992629498861724]
We develop a new algorithm for non-critical optimization that finds an $O(epsilon)$epsilon point in the optimal product.
We validate our results on a variety of large-scale deep learning benchmarks and architectures.
arXiv Detail & Related papers (2021-03-04T19:01:20Z) - Stabilizing Q Learning Via Soft Mellowmax Operator [12.208344427928466]
Mellowmax is a proposed differentiable and non-expansion softmax operator that allows a convergent behavior in learning and planning.
We show that our SM2 operator can be applied to the challenging multi-agent reinforcement learning scenarios, leading to stable value function approximation and state of the art performance.
arXiv Detail & Related papers (2020-12-17T09:11:13Z) - Large-Scale Methods for Distributionally Robust Optimization [53.98643772533416]
We prove that our algorithms require a number of evaluations gradient independent of training set size and number of parameters.
Experiments on MNIST and ImageNet confirm the theoretical scaling of our algorithms, which are 9--36 times more efficient than full-batch methods.
arXiv Detail & Related papers (2020-10-12T17:41:44Z) - Taming GANs with Lookahead-Minmax [63.90038365274479]
Experimental results on MNIST, SVHN, CIFAR-10, and ImageNet demonstrate a clear advantage of combining Lookahead-minmax with Adam or extragradient.
Using 30-fold fewer parameters and 16-fold smaller minibatches we outperform the reported performance of the class-dependent BigGAN on CIFAR-10 by obtaining FID of 12.19 without using the class labels.
arXiv Detail & Related papers (2020-06-25T17:13:23Z) - WD3: Taming the Estimation Bias in Deep Reinforcement Learning [7.29018671106362]
We show that TD3 algorithm introduces underestimation bias in mild assumptions.
We propose a novel algorithm underlineWeighted underlineDelayed underlineDeep underlineDeterministic Policy Gradient (WD3), which can eliminate the estimation bias.
arXiv Detail & Related papers (2020-06-18T01:28:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.