Related papers: Ensemble Bootstrapping for Q-Learning

Ensemble Bootstrapping for Q-Learning

URL: http://arxiv.org/abs/2103.00445v1
Date: Sun, 28 Feb 2021 10:19:47 GMT
Title: Ensemble Bootstrapping for Q-Learning
Authors: Oren Peer, Chen Tessler, Nadav Merlis, Ron Meir
Abstract summary: We introduce a new bias-reduced algorithm called Ensemble Bootstrapped Q-Learning (EBQL) EBQL-like updates yield lower MSE when estimating the maximal mean of a set of independent random variables. We show that there exist domains where both over and under-estimation result in sub-optimal performance.
Score: 15.07549655582389
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Q-learning (QL), a common reinforcement learning algorithm, suffers from over-estimation bias due to the maximization term in the optimal Bellman operator. This bias may lead to sub-optimal behavior. Double-Q-learning tackles this issue by utilizing two estimators, yet results in an under-estimation bias. Similar to over-estimation in Q-learning, in certain scenarios, the under-estimation bias may degrade performance. In this work, we introduce a new bias-reduced algorithm called Ensemble Bootstrapped Q-Learning (EBQL), a natural extension of Double-Q-learning to ensembles. We analyze our method both theoretically and empirically. Theoretically, we prove that EBQL-like updates yield lower MSE when estimating the maximal mean of a set of independent random variables. Empirically, we show that there exist domains where both over and under-estimation result in sub-optimal performance. Finally, We demonstrate the superior performance of a deep RL variant of EBQL over other deep QL algorithms for a suite of ATARI games.

Related papers

HAVER: Instance-Dependent Error Bounds for Maximum Mean Estimation and Applications to Q-Learning and Monte Carlo Tree Search [9.811007096988755]
We study the problem of estimating the emphvalue of the largest mean among K distributions via samples from them. We propose a novel algorithm called HAVER and analyze its mean squared error.
arXiv Detail & Related papers (2024-11-01T07:05:11Z)
Regularized Q-learning through Robust Averaging [3.4354636842203026]
We propose a new Q-learning variant, called 2RA Q-learning, that addresses some weaknesses of existing Q-learning methods in a principled manner. One such weakness is an underlying estimation bias which cannot be controlled and often results in poor performance. We show that 2RA Q-learning converges to the optimal policy and analyze its theoretical mean-squared error.
arXiv Detail & Related papers (2024-05-03T15:57:26Z)
Unifying (Quantum) Statistical and Parametrized (Quantum) Algorithms [65.268245109828]
We take inspiration from Kearns' SQ oracle and Valiant's weak evaluation oracle. We introduce an extensive yet intuitive framework that yields unconditional lower bounds for learning from evaluation queries.
arXiv Detail & Related papers (2023-10-26T18:23:21Z)
Simultaneous Double Q-learning with Conservative Advantage Learning for Actor-Critic Methods [133.85604983925282]
We propose Simultaneous Double Q-learning with Conservative Advantage Learning (SDQ-CAL) Our algorithm realizes less biased value estimation and achieves state-of-the-art performance in a range of continuous control benchmark tasks.
arXiv Detail & Related papers (2022-05-08T09:17:16Z)
Balanced Q-learning: Combining the Influence of Optimistic and Pessimistic Targets [74.04426767769785]
We show that specific types of biases may be preferable, depending on the scenario. We design a novel reinforcement learning algorithm, Balanced Q-learning, in which the target is modified to be a convex combination of a pessimistic and an optimistic term.
arXiv Detail & Related papers (2021-11-03T07:30:19Z)
Online Target Q-learning with Reverse Experience Replay: Efficiently finding the Optimal Policy for Linear MDPs [50.75812033462294]
We bridge the gap between practical success of Q-learning and pessimistic theoretical results. We present novel methods Q-Rex and Q-RexDaRe. We show that Q-Rex efficiently finds the optimal policy for linear MDPs.
arXiv Detail & Related papers (2021-10-16T01:47:41Z)
On the Estimation Bias in Double Q-Learning [20.856485777692594]
Double Q-learning is not fully unbiased and suffers from underestimation bias. We show that such underestimation bias may lead to multiple non-optimal fixed points under an approximated Bellman operator. We propose a simple but effective approach as a partial fix for the underestimation bias in double Q-learning.
arXiv Detail & Related papers (2021-09-29T13:41:24Z)
Self-correcting Q-Learning [14.178899938667161]
We introduce a new way to address the bias in the form of a "self-correcting algorithm" Applying this strategy to Q-learning results in Self-correcting Q-learning. We show theoretically that this new algorithm enjoys the same convergence guarantees as Q-learning while being more accurate.
arXiv Detail & Related papers (2020-12-02T11:36:24Z)
Finite-Time Analysis for Double Q-learning [50.50058000948908]
We provide the first non-asymptotic, finite-time analysis for double Q-learning. We show that both synchronous and asynchronous double Q-learning are guaranteed to converge to an $epsilon$-accurate neighborhood of the global optimum.
arXiv Detail & Related papers (2020-09-29T18:48:21Z)
Cross Learning in Deep Q-Networks [82.20059754270302]
We propose a novel cross Q-learning algorithm, aim at alleviating the well-known overestimation problem in value-based reinforcement learning methods. Our algorithm builds on double Q-learning, by maintaining a set of parallel models and estimate the Q-value based on a randomly selected network.
arXiv Detail & Related papers (2020-09-29T04:58:17Z)
Maxmin Q-learning: Controlling the Estimation Bias of Q-learning [31.742397178618624]
Overestimation bias affects Q-learning because it approximates the maximum action value using the maximum estimated action value. We propose a generalization of Q-learning, called emphMaxmin Q-learning, which provides a parameter to flexibly control bias. We empirically verify that our algorithm better controls estimation bias in toy environments, and that it achieves superior performance on several benchmark problems.
arXiv Detail & Related papers (2020-02-16T02:02:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.