Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and
Variance Reduction
- URL: http://arxiv.org/abs/2006.03041v3
- Date: Sun, 8 Aug 2021 03:42:12 GMT
- Title: Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and
Variance Reduction
- Authors: Gen Li, Yuting Wei, Yuejie Chi, Yuantao Gu, Yuxin Chen
- Abstract summary: Asynchronous Q-learning aims to learn the optimal action-value function (or Q-function) of a Markov decision process (MDP)
We show that the number of samples needed to yield an entrywise $varepsilon$-accurate estimate of the Q-function is at most on the order of $frac1mu_min (1-gamma)5varepsilon2+ fract_mixmu_min (1-gamma)$ up to some logarithmic factor.
- Score: 63.41789556777387
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Asynchronous Q-learning aims to learn the optimal action-value function (or
Q-function) of a Markov decision process (MDP), based on a single trajectory of
Markovian samples induced by a behavior policy. Focusing on a
$\gamma$-discounted MDP with state space $\mathcal{S}$ and action space
$\mathcal{A}$, we demonstrate that the $\ell_{\infty}$-based sample complexity
of classical asynchronous Q-learning --- namely, the number of samples needed
to yield an entrywise $\varepsilon$-accurate estimate of the Q-function --- is
at most on the order of $\frac{1}{\mu_{\min}(1-\gamma)^5\varepsilon^2}+
\frac{t_{mix}}{\mu_{\min}(1-\gamma)}$ up to some logarithmic factor, provided
that a proper constant learning rate is adopted. Here, $t_{mix}$ and
$\mu_{\min}$ denote respectively the mixing time and the minimum state-action
occupancy probability of the sample trajectory. The first term of this bound
matches the sample complexity in the synchronous case with independent samples
drawn from the stationary distribution of the trajectory. The second term
reflects the cost taken for the empirical distribution of the Markovian
trajectory to reach a steady state, which is incurred at the very beginning and
becomes amortized as the algorithm runs. Encouragingly, the above bound
improves upon the state-of-the-art result \cite{qu2020finite} by a factor of at
least $|\mathcal{S}||\mathcal{A}|$ for all scenarios, and by a factor of at
least $t_{mix}|\mathcal{S}||\mathcal{A}|$ for any sufficiently small accuracy
level $\varepsilon$. Further, we demonstrate that the scaling on the effective
horizon $\frac{1}{1-\gamma}$ can be improved by means of variance reduction.
Related papers
- Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators.
Key to our solution is a novel projection technique based on ideas from harmonic analysis.
Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z) - Multiple-policy Evaluation via Density Estimation [30.914344538340412]
We propose an algorithm named $mathrmCAESAR$ for this problem.
Up to low order and logarithmic terms $mathrmCAESAR$ achieves a sample complexity $tildeOleft(fracH4epsilon2sum_h=1Hmax_kin[K]sum_s,afrac(d_hpik(s,a))2mu*_h(s,a)right)$, where $dpi
arXiv Detail & Related papers (2024-03-29T23:55:25Z) - Optimal Sample Complexity for Average Reward Markov Decision Processes [1.0445957451908694]
We develop an estimator for the optimal policy of average reward MDPs with a sample complexity of $widetilde O(|S||A|t_textmix2 epsilon-2)$.
This marks the first algorithm and analysis to reach the literature's lower bound.
arXiv Detail & Related papers (2023-10-13T03:08:59Z) - Settling the Sample Complexity of Model-Based Offline Reinforcement
Learning [50.5790774201146]
offline reinforcement learning (RL) learns using pre-collected data without further exploration.
Prior algorithms or analyses either suffer from suboptimal sample complexities or incur high burn-in cost to reach sample optimality.
We demonstrate that the model-based (or "plug-in") approach achieves minimax-optimal sample complexity without burn-in cost.
arXiv Detail & Related papers (2022-04-11T17:26:19Z) - Sample-Efficient Reinforcement Learning for Linearly-Parameterized MDPs
with a Generative Model [3.749193647980305]
This paper considers a Markov decision process (MDP) that admits a set of state-action features.
We show that a model-based approach (resp.$$Q-learning) provably learns an $varepsilon$-optimal policy with high probability.
arXiv Detail & Related papers (2021-05-28T17:49:39Z) - Tightening the Dependence on Horizon in the Sample Complexity of
Q-Learning [59.71676469100807]
This work sharpens the sample complexity of synchronous Q-learning to an order of $frac|mathcalS|| (1-gamma)4varepsilon2$ for any $0varepsilon 1$.
Our finding unveils the effectiveness of vanilla Q-learning, which matches that of speedy Q-learning without requiring extra computation and storage.
arXiv Detail & Related papers (2021-02-12T14:22:05Z) - Finite-Time Analysis for Double Q-learning [50.50058000948908]
We provide the first non-asymptotic, finite-time analysis for double Q-learning.
We show that both synchronous and asynchronous double Q-learning are guaranteed to converge to an $epsilon$-accurate neighborhood of the global optimum.
arXiv Detail & Related papers (2020-09-29T18:48:21Z) - Breaking the Sample Size Barrier in Model-Based Reinforcement Learning
with a Generative Model [50.38446482252857]
This paper is concerned with the sample efficiency of reinforcement learning, assuming access to a generative model (or simulator)
We first consider $gamma$-discounted infinite-horizon Markov decision processes (MDPs) with state space $mathcalS$ and action space $mathcalA$.
We prove that a plain model-based planning algorithm suffices to achieve minimax-optimal sample complexity given any target accuracy level.
arXiv Detail & Related papers (2020-05-26T17:53:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.