Finite-Time Analysis for Double Q-learning
- URL: http://arxiv.org/abs/2009.14257v2
- Date: Mon, 12 Oct 2020 14:13:15 GMT
- Title: Finite-Time Analysis for Double Q-learning
- Authors: Huaqing Xiong, Lin Zhao, Yingbin Liang, Wei Zhang
- Abstract summary: We provide the first non-asymptotic, finite-time analysis for double Q-learning.
We show that both synchronous and asynchronous double Q-learning are guaranteed to converge to an $epsilon$-accurate neighborhood of the global optimum.
- Score: 50.50058000948908
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although Q-learning is one of the most successful algorithms for finding the
best action-value function (and thus the optimal policy) in reinforcement
learning, its implementation often suffers from large overestimation of
Q-function values incurred by random sampling. The double Q-learning algorithm
proposed in~\citet{hasselt2010double} overcomes such an overestimation issue by
randomly switching the update between two Q-estimators, and has thus gained
significant popularity in practice. However, the theoretical understanding of
double Q-learning is rather limited. So far only the asymptotic convergence has
been established, which does not characterize how fast the algorithm converges.
In this paper, we provide the first non-asymptotic (i.e., finite-time) analysis
for double Q-learning. We show that both synchronous and asynchronous double
Q-learning are guaranteed to converge to an $\epsilon$-accurate neighborhood of
the global optimum by taking $\tilde{\Omega}\left(\left(
\frac{1}{(1-\gamma)^6\epsilon^2}\right)^{\frac{1}{\omega}}
+\left(\frac{1}{1-\gamma}\right)^{\frac{1}{1-\omega}}\right)$ iterations, where
$\omega\in(0,1)$ is the decay parameter of the learning rate, and $\gamma$ is
the discount factor. Our analysis develops novel techniques to derive
finite-time bounds on the difference between two inter-connected stochastic
processes, which is new to the literature of stochastic approximation.
Related papers
- Two-Step Q-Learning [0.0]
The paper proposes a novel off-policy two-step Q-learning algorithm, without importance sampling.
Numerical experiments demonstrate the superior performance of both the two-step Q-learning and its smooth variants.
arXiv Detail & Related papers (2024-07-02T15:39:00Z) - Finite-Time Error Bounds for Greedy-GQ [20.51105692499517]
We show that Greedy-GQ algorithm converges fast as finite-time error.
Our analysis provides for faster convergence step-sizes for choosing step-sizes.
arXiv Detail & Related papers (2022-09-06T15:04:57Z) - Sharper Convergence Guarantees for Asynchronous SGD for Distributed and
Federated Learning [77.22019100456595]
We show a training algorithm for distributed computation workers with varying communication frequency.
In this work, we obtain a tighter convergence rate of $mathcalO!!!(sigma2-2_avg!! .
We also show that the heterogeneity term in rate is affected by the average delay within each worker.
arXiv Detail & Related papers (2022-06-16T17:10:57Z) - Tightening the Dependence on Horizon in the Sample Complexity of
Q-Learning [59.71676469100807]
This work sharpens the sample complexity of synchronous Q-learning to an order of $frac|mathcalS|| (1-gamma)4varepsilon2$ for any $0varepsilon 1$.
Our finding unveils the effectiveness of vanilla Q-learning, which matches that of speedy Q-learning without requiring extra computation and storage.
arXiv Detail & Related papers (2021-02-12T14:22:05Z) - Single-Timescale Stochastic Nonconvex-Concave Optimization for Smooth
Nonlinear TD Learning [145.54544979467872]
We propose two single-timescale single-loop algorithms that require only one data point each step.
Our results are expressed in a form of simultaneous primal and dual side convergence.
arXiv Detail & Related papers (2020-08-23T20:36:49Z) - Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and
Variance Reduction [63.41789556777387]
Asynchronous Q-learning aims to learn the optimal action-value function (or Q-function) of a Markov decision process (MDP)
We show that the number of samples needed to yield an entrywise $varepsilon$-accurate estimate of the Q-function is at most on the order of $frac1mu_min (1-gamma)5varepsilon2+ fract_mixmu_min (1-gamma)$ up to some logarithmic factor.
arXiv Detail & Related papers (2020-06-04T17:51:00Z) - Finite-Time Last-Iterate Convergence for Multi-Agent Learning in Games [116.0771177871705]
We characterize the finite-time last-iterate convergence rate for joint OGD learning on $lambda$-cocoercive games.
We show, via a novel double-stopping time technique, that this adaptive algorithm achieves same finite-time last-iterate convergence rate as non-adaptive counterpart.
arXiv Detail & Related papers (2020-02-23T01:46:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.