Adaptively Calibrated Critic Estimates for Deep Reinforcement Learning
- URL: http://arxiv.org/abs/2111.12673v1
- Date: Wed, 24 Nov 2021 18:07:33 GMT
- Title: Adaptively Calibrated Critic Estimates for Deep Reinforcement Learning
- Authors: Nicolai Dorka, Joschka Boedecker, Wolfram Burgard
- Abstract summary: We propose a general method called Adaptively Calibrated Critics (ACC)
ACC uses the most recent high variance but unbiased on-policy rollouts to alleviate the bias of the low variance temporal difference targets.
We show that ACC is quite general by further applying it to TD3 and showing an improved performance also in this setting.
- Score: 36.643572071860554
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Accurate value estimates are important for off-policy reinforcement learning.
Algorithms based on temporal difference learning typically are prone to an
over- or underestimation bias building up over time. In this paper, we propose
a general method called Adaptively Calibrated Critics (ACC) that uses the most
recent high variance but unbiased on-policy rollouts to alleviate the bias of
the low variance temporal difference targets. We apply ACC to Truncated
Quantile Critics, which is an algorithm for continuous control that allows
regulation of the bias with a hyperparameter tuned per environment. The
resulting algorithm adaptively adjusts the parameter during training rendering
hyperparameter search unnecessary and sets a new state of the art on the OpenAI
gym continuous control benchmark among all algorithms that do not tune
hyperparameters for each environment. Additionally, we demonstrate that ACC is
quite general by further applying it to TD3 and showing an improved performance
also in this setting.
Related papers
- ReLU to the Rescue: Improve Your On-Policy Actor-Critic with Positive Advantages [37.12048108122337]
This paper proposes a step toward approximate Bayesian inference in on-policy actor-critic deep reinforcement learning.
It is implemented through three changes to the Asynchronous Advantage Actor-Critic (A3C) algorithm.
arXiv Detail & Related papers (2023-06-02T11:37:22Z) - Actor-Critic based Improper Reinforcement Learning [61.430513757337486]
We consider an improper reinforcement learning setting where a learner is given $M$ base controllers for an unknown Markov decision process.
We propose two algorithms: (1) a Policy Gradient-based approach; and (2) an algorithm that can switch between a simple Actor-Critic scheme and a Natural Actor-Critic scheme.
arXiv Detail & Related papers (2022-07-19T05:55:02Z) - Efficient and Differentiable Conformal Prediction with General Function
Classes [96.74055810115456]
We propose a generalization of conformal prediction to multiple learnable parameters.
We show that it achieves approximate valid population coverage and near-optimal efficiency within class.
Experiments show that our algorithm is able to learn valid prediction sets and improve the efficiency significantly.
arXiv Detail & Related papers (2022-02-22T18:37:23Z) - AWD3: Dynamic Reduction of the Estimation Bias [0.0]
We introduce a technique that eliminates the estimation bias in off-policy continuous control algorithms using the experience replay mechanism.
We show through continuous control environments of OpenAI gym that our algorithm matches or outperforms the state-of-the-art off-policy policy gradient learning algorithms.
arXiv Detail & Related papers (2021-11-12T15:46:19Z) - Automating Control of Overestimation Bias for Continuous Reinforcement
Learning [65.63607016094305]
We present a data-driven approach for guiding bias correction.
We demonstrate its effectiveness on the Truncated Quantile Critics -- a state-of-the-art continuous control algorithm.
arXiv Detail & Related papers (2021-10-26T09:27:12Z) - Doubly Robust Off-Policy Actor-Critic: Convergence and Optimality [131.45028999325797]
We develop a doubly robust off-policy AC (DR-Off-PAC) for discounted MDP.
DR-Off-PAC adopts a single timescale structure, in which both actor and critics are updated simultaneously with constant stepsize.
We study the finite-time convergence rate and characterize the sample complexity for DR-Off-PAC to attain an $epsilon$-accurate optimal policy.
arXiv Detail & Related papers (2021-02-23T18:56:13Z) - Adaptive Gradient Method with Resilience and Momentum [120.83046824742455]
We propose an Adaptive Gradient Method with Resilience and Momentum (AdaRem)
AdaRem adjusts the parameter-wise learning rate according to whether the direction of one parameter changes in the past is aligned with the direction of the current gradient.
Our method outperforms previous adaptive learning rate-based algorithms in terms of the training speed and the test error.
arXiv Detail & Related papers (2020-10-21T14:49:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.