Ensemble Bootstrapping for Q-Learning
- URL: http://arxiv.org/abs/2103.00445v1
- Date: Sun, 28 Feb 2021 10:19:47 GMT
- Title: Ensemble Bootstrapping for Q-Learning
- Authors: Oren Peer, Chen Tessler, Nadav Merlis, Ron Meir
- Abstract summary: We introduce a new bias-reduced algorithm called Ensemble Bootstrapped Q-Learning (EBQL)
EBQL-like updates yield lower MSE when estimating the maximal mean of a set of independent random variables.
We show that there exist domains where both over and under-estimation result in sub-optimal performance.
- Score: 15.07549655582389
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Q-learning (QL), a common reinforcement learning algorithm, suffers from
over-estimation bias due to the maximization term in the optimal Bellman
operator. This bias may lead to sub-optimal behavior. Double-Q-learning tackles
this issue by utilizing two estimators, yet results in an under-estimation
bias. Similar to over-estimation in Q-learning, in certain scenarios, the
under-estimation bias may degrade performance. In this work, we introduce a new
bias-reduced algorithm called Ensemble Bootstrapped Q-Learning (EBQL), a
natural extension of Double-Q-learning to ensembles. We analyze our method both
theoretically and empirically. Theoretically, we prove that EBQL-like updates
yield lower MSE when estimating the maximal mean of a set of independent random
variables. Empirically, we show that there exist domains where both over and
under-estimation result in sub-optimal performance. Finally, We demonstrate the
superior performance of a deep RL variant of EBQL over other deep QL algorithms
for a suite of ATARI games.
Related papers
- Regularized Q-learning through Robust Averaging [3.4354636842203026]
We propose a new Q-learning variant, called 2RA Q-learning, that addresses some weaknesses of existing Q-learning methods in a principled manner.
One such weakness is an underlying estimation bias which cannot be controlled and often results in poor performance.
We show that 2RA Q-learning converges to the optimal policy and analyze its theoretical mean-squared error.
arXiv Detail & Related papers (2024-05-03T15:57:26Z) - Unifying (Quantum) Statistical and Parametrized (Quantum) Algorithms [65.268245109828]
We take inspiration from Kearns' SQ oracle and Valiant's weak evaluation oracle.
We introduce an extensive yet intuitive framework that yields unconditional lower bounds for learning from evaluation queries.
arXiv Detail & Related papers (2023-10-26T18:23:21Z) - Simultaneous Double Q-learning with Conservative Advantage Learning for
Actor-Critic Methods [133.85604983925282]
We propose Simultaneous Double Q-learning with Conservative Advantage Learning (SDQ-CAL)
Our algorithm realizes less biased value estimation and achieves state-of-the-art performance in a range of continuous control benchmark tasks.
arXiv Detail & Related papers (2022-05-08T09:17:16Z) - Balanced Q-learning: Combining the Influence of Optimistic and
Pessimistic Targets [74.04426767769785]
We show that specific types of biases may be preferable, depending on the scenario.
We design a novel reinforcement learning algorithm, Balanced Q-learning, in which the target is modified to be a convex combination of a pessimistic and an optimistic term.
arXiv Detail & Related papers (2021-11-03T07:30:19Z) - Online Target Q-learning with Reverse Experience Replay: Efficiently
finding the Optimal Policy for Linear MDPs [50.75812033462294]
We bridge the gap between practical success of Q-learning and pessimistic theoretical results.
We present novel methods Q-Rex and Q-RexDaRe.
We show that Q-Rex efficiently finds the optimal policy for linear MDPs.
arXiv Detail & Related papers (2021-10-16T01:47:41Z) - On the Estimation Bias in Double Q-Learning [20.856485777692594]
Double Q-learning is not fully unbiased and suffers from underestimation bias.
We show that such underestimation bias may lead to multiple non-optimal fixed points under an approximated Bellman operator.
We propose a simple but effective approach as a partial fix for the underestimation bias in double Q-learning.
arXiv Detail & Related papers (2021-09-29T13:41:24Z) - Self-correcting Q-Learning [14.178899938667161]
We introduce a new way to address the bias in the form of a "self-correcting algorithm"
Applying this strategy to Q-learning results in Self-correcting Q-learning.
We show theoretically that this new algorithm enjoys the same convergence guarantees as Q-learning while being more accurate.
arXiv Detail & Related papers (2020-12-02T11:36:24Z) - Finite-Time Analysis for Double Q-learning [50.50058000948908]
We provide the first non-asymptotic, finite-time analysis for double Q-learning.
We show that both synchronous and asynchronous double Q-learning are guaranteed to converge to an $epsilon$-accurate neighborhood of the global optimum.
arXiv Detail & Related papers (2020-09-29T18:48:21Z) - Cross Learning in Deep Q-Networks [82.20059754270302]
We propose a novel cross Q-learning algorithm, aim at alleviating the well-known overestimation problem in value-based reinforcement learning methods.
Our algorithm builds on double Q-learning, by maintaining a set of parallel models and estimate the Q-value based on a randomly selected network.
arXiv Detail & Related papers (2020-09-29T04:58:17Z) - Maxmin Q-learning: Controlling the Estimation Bias of Q-learning [31.742397178618624]
Overestimation bias affects Q-learning because it approximates the maximum action value using the maximum estimated action value.
We propose a generalization of Q-learning, called emphMaxmin Q-learning, which provides a parameter to flexibly control bias.
We empirically verify that our algorithm better controls estimation bias in toy environments, and that it achieves superior performance on several benchmark problems.
arXiv Detail & Related papers (2020-02-16T02:02:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.