$Q$-learning with Logarithmic Regret
- URL: http://arxiv.org/abs/2006.09118v2
- Date: Tue, 23 Feb 2021 11:44:44 GMT
- Title: $Q$-learning with Logarithmic Regret
- Authors: Kunhe Yang, Lin F. Yang, Simon S. Du
- Abstract summary: We prove that an optimistic $Q$-learning enjoys a $mathcalOleft(fracSAcdot mathrmpolyleft(Hright)Delta_minlogleft(SATright)right)$ cumulative regret bound, where $S$ is the number of states, $A$ is the number of actions, $H$ is the planning horizon, $T$ is the total number of steps, and $Delta_min$ is the minimum sub-optimality gap.
- Score: 60.24952657636464
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents the first non-asymptotic result showing that a model-free
algorithm can achieve a logarithmic cumulative regret for episodic tabular
reinforcement learning if there exists a strictly positive sub-optimality gap
in the optimal $Q$-function. We prove that the optimistic $Q$-learning studied
in [Jin et al. 2018] enjoys a ${\mathcal{O}}\left(\frac{SA\cdot
\mathrm{poly}\left(H\right)}{\Delta_{\min}}\log\left(SAT\right)\right)$
cumulative regret bound, where $S$ is the number of states, $A$ is the number
of actions, $H$ is the planning horizon, $T$ is the total number of steps, and
$\Delta_{\min}$ is the minimum sub-optimality gap. This bound matches the
information theoretical lower bound in terms of $S,A,T$ up to a
$\log\left(SA\right)$ factor. We further extend our analysis to the discounted
setting and obtain a similar logarithmic cumulative regret bound.
Related papers
- Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning [54.806166861456035]
We study the episodic reinforcement learning (RL) problem modeled by finite-horizon Markov Decision Processes (MDPs) with constraint on the number of batches.
We design a computational efficient algorithm to achieve near-optimal regret of $tildeO(sqrtSAH3Kln (1/delta))$tildeO(cdot) hides logarithmic terms of $(S,A,H,K)$ in $K$ episodes.
Our technical contribution are two-fold: 1) a near-optimal design scheme to explore
arXiv Detail & Related papers (2022-10-15T09:22:22Z) - Optimal SQ Lower Bounds for Learning Halfspaces with Massart Noise [9.378684220920562]
tightest statistical query (SQ) lower bounds for learnining halfspaces in the presence of Massart noise.
We show that for arbitrary $eta in [0,1/2]$ every SQ algorithm achieving misclassification error better than $eta$ requires queries of superpolynomial accuracy.
arXiv Detail & Related papers (2022-01-24T17:33:19Z) - Improved No-Regret Algorithms for Stochastic Shortest Path with Linear
MDP [31.62899359543925]
We introduce two new no-regret algorithms for the shortest path (SSP) problem with a linear MDP.
Our first algorithm is computationally efficient and achieves a regret bound $widetildeOleft(sqrtd3B_star2T_star Kright)$.
Our second algorithm is computationally inefficient but achieves the first "horizon-free" regret bound $widetildeO(d3.5B_starsqrtK)$ with no dependency on $T_star
arXiv Detail & Related papers (2021-12-18T06:47:31Z) - Logarithmic Regret from Sublinear Hints [76.87432703516942]
We show that an algorithm can obtain $O(log T)$ regret with just $O(sqrtT)$ hints under a natural query model.
We also show that $o(sqrtT)$ hints cannot guarantee better than $Omega(sqrtT)$ regret.
arXiv Detail & Related papers (2021-11-09T16:50:18Z) - Nearly Horizon-Free Offline Reinforcement Learning [97.36751930393245]
We revisit offline reinforcement learning on episodic time-homogeneous Markov Decision Processes with $S$ states, $A$ actions and planning horizon $H$.
We obtain the first set of nearly $H$-free sample complexity bounds for evaluation and planning using the empirical MDPs.
arXiv Detail & Related papers (2021-03-25T18:52:17Z) - Logarithmic Regret for Reinforcement Learning with Linear Function
Approximation [99.59319332864129]
We show that logarithmic regret is attainable under two recently proposed linear MDP assumptions.
To the best of our knowledge, these are the first logarithmic regret bounds for RL with linear function approximation.
arXiv Detail & Related papers (2020-11-23T17:25:00Z) - Agnostic Q-learning with Function Approximation in Deterministic
Systems: Tight Bounds on Approximation Error and Sample Complexity [94.37110094442136]
We study the problem of agnostic $Q$-learning with function approximation in deterministic systems.
We show that if $delta = Oleft(rho/sqrtdim_Eright)$, then one can find the optimal policy using $Oleft(dim_Eright)$.
arXiv Detail & Related papers (2020-02-17T18:41:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.