Polyak-Ruppert Averaged Q-Leaning is Statistically Efficient
- URL: http://arxiv.org/abs/2112.14582v1
- Date: Wed, 29 Dec 2021 14:47:56 GMT
- Title: Polyak-Ruppert Averaged Q-Leaning is Statistically Efficient
- Authors: Xiang Li, Wenhao Yang, Zhihua Zhang, Michael I. Jordan
- Abstract summary: We study synchronous Q-learning with Polyak-Ruppert averaging (a.k.a., averaged Q-leaning) in a $gamma$-discounted MDP.
We establish normality for the iteration averaged $barboldsymbolQ_T$.
In short, our theoretical analysis shows averaged Q-Leaning is statistically efficient.
- Score: 90.14768299744792
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study synchronous Q-learning with Polyak-Ruppert averaging (a.k.a.,
averaged Q-leaning) in a $\gamma$-discounted MDP. We establish asymptotic
normality for the averaged iteration $\bar{\boldsymbol{Q}}_T$. Furthermore, we
show that $\bar{\boldsymbol{Q}}_T$ is actually a regular asymptotically linear
(RAL) estimator for the optimal Q-value function $\boldsymbol{Q}^*$ with the
most efficient influence function. It implies the averaged Q-learning iteration
has the smallest asymptotic variance among all RAL estimators. In addition, we
present a non-asymptotic analysis for the $\ell_{\infty}$ error
$\mathbb{E}\|\bar{\boldsymbol{Q}}_T-\boldsymbol{Q}^*\|_{\infty}$, showing it
matches the instance-dependent lower bound as well as the optimal minimax
complexity lower bound. As a byproduct, we find the Bellman noise has
sub-Gaussian coordinates with variance $\mathcal{O}((1-\gamma)^{-1})$ instead
of the prevailing $\mathcal{O}((1-\gamma)^{-2})$ under the standard bounded
reward assumption. The sub-Gaussian result has potential to improve the sample
complexity of many RL algorithms. In short, our theoretical analysis shows
averaged Q-Leaning is statistically efficient.
Related papers
- Multiple-policy Evaluation via Density Estimation [30.914344538340412]
We propose an algorithm named $mathrmCAESAR$ for this problem.
Up to low order and logarithmic terms $mathrmCAESAR$ achieves a sample complexity $tildeOleft(fracH4epsilon2sum_h=1Hmax_kin[K]sum_s,afrac(d_hpik(s,a))2mu*_h(s,a)right)$, where $dpi
arXiv Detail & Related papers (2024-03-29T23:55:25Z) - Optimal score estimation via empirical Bayes smoothing [13.685846094715364]
We study the problem of estimating the score function of an unknown probability distribution $rho*$ from $n$ independent and identically distributed observations in $d$ dimensions.
We show that a regularized score estimator based on a Gaussian kernel attains this rate, shown optimal by a matching minimax lower bound.
arXiv Detail & Related papers (2024-02-12T16:17:40Z) - Finite-time High-probability Bounds for Polyak-Ruppert Averaged Iterates
of Linear Stochastic Approximation [22.51165277694864]
This paper provides a finite-time analysis of linear approximation (LSA) algorithms with fixed step size.
LSA is used to compute approximate solutions of a $d$-dimensional linear system.
arXiv Detail & Related papers (2022-07-10T14:36:04Z) - Statistical Query Lower Bounds for List-Decodable Linear Regression [55.06171096484622]
We study the problem of list-decodable linear regression, where an adversary can corrupt a majority of the examples.
Our main result is a Statistical Query (SQ) lower bound of $dmathrmpoly (1/alpha)$ for this problem.
arXiv Detail & Related papers (2021-06-17T17:45:21Z) - Tightening the Dependence on Horizon in the Sample Complexity of
Q-Learning [59.71676469100807]
This work sharpens the sample complexity of synchronous Q-learning to an order of $frac|mathcalS|| (1-gamma)4varepsilon2$ for any $0varepsilon 1$.
Our finding unveils the effectiveness of vanilla Q-learning, which matches that of speedy Q-learning without requiring extra computation and storage.
arXiv Detail & Related papers (2021-02-12T14:22:05Z) - Accelerating Optimization and Reinforcement Learning with
Quasi-Stochastic Approximation [2.294014185517203]
This paper sets out to extend convergence theory to quasi-stochastic approximations.
It is illustrated with applications to gradient-free optimization and policy gradient algorithms for reinforcement learning.
arXiv Detail & Related papers (2020-09-30T04:44:45Z) - Optimal Robust Linear Regression in Nearly Linear Time [97.11565882347772]
We study the problem of high-dimensional robust linear regression where a learner is given access to $n$ samples from the generative model $Y = langle X,w* rangle + epsilon$
We propose estimators for this problem under two settings: (i) $X$ is L4-L2 hypercontractive, $mathbbE [XXtop]$ has bounded condition number and $epsilon$ has bounded variance and (ii) $X$ is sub-Gaussian with identity second moment and $epsilon$ is
arXiv Detail & Related papers (2020-07-16T06:44:44Z) - Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and
Variance Reduction [63.41789556777387]
Asynchronous Q-learning aims to learn the optimal action-value function (or Q-function) of a Markov decision process (MDP)
We show that the number of samples needed to yield an entrywise $varepsilon$-accurate estimate of the Q-function is at most on the order of $frac1mu_min (1-gamma)5varepsilon2+ fract_mixmu_min (1-gamma)$ up to some logarithmic factor.
arXiv Detail & Related papers (2020-06-04T17:51:00Z) - Differentially Quantized Gradient Methods [53.3186247068836]
We show that Differentially Quantized Gradient Descent (DQ-GD) attains a linear contraction factor of $maxsigma_mathrmGD, rhon 2-R$.
No algorithm within a certain class can converge faster than $maxsigma_mathrmGD, 2-R$.
arXiv Detail & Related papers (2020-02-06T20:40:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.