Related papers: Regret and Sample Complexity of Online Q-Learning via Concentration of Stochastic Approximation with Time-Inhomogeneous Markov Chains

Regret and Sample Complexity of Online Q-Learning via Concentration of Stochastic Approximation with Time-Inhomogeneous Markov Chains

URL: http://arxiv.org/abs/2602.16274v1
Date: Wed, 18 Feb 2026 08:47:07 GMT
Title: Regret and Sample Complexity of Online Q-Learning via Concentration of Stochastic Approximation with Time-Inhomogeneous Markov Chains
Authors: Rahul Singh, Siddharth Chandak, Eric Moulines, Vivek S. Borkar, Nicholas Bambos,
Abstract summary: We present the first high-probability regret bound for classical online Q-learning in infinite-horizon discounted Markov decision processes.<n>For sufficiently large gaps, the regret is sublinear, while for small gaps it deteriorates and can approach linear growth.
Score: 23.565936864449636
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present the first high-probability regret bound for classical online Q-learning in infinite-horizon discounted Markov decision processes, without relying on optimism or bonus terms. We first analyze Boltzmann Q-learning with decaying temperature and show that its regret depends critically on the suboptimality gap of the MDP: for sufficiently large gaps, the regret is sublinear, while for small gaps it deteriorates and can approach linear growth. To address this limitation, we study a Smoothed $ε_n$-Greedy exploration scheme that combines $ε_n$-greedy and Boltzmann exploration, for which we prove a gap-robust regret bound of near-$\tilde{O}(N^{9/10})$. To analyze these algorithms, we develop a high-probability concentration bound for contractive Markovian stochastic approximation with iterate- and time-dependent transition dynamics. This bound may be of independent interest as the contraction factor in our bound is governed by the mixing time and is allowed to converge to one asymptotically.

Related papers

Achieving $\varepsilon^{-2}$ Dependence for Average-Reward Q-Learning with a New Contraction Principle [13.418969441591882]
We present the convergence rates of synchronous and asynchronous Q-learning for average-reward Markov decision processes.<n>At the core of our analysis is the construction of an instance-dependent seminorm and showing that, after a lazy transformation of the Markov decision process, the Bellman operator becomes one-step contractive under this seminorm.
arXiv Detail & Related papers (2026-01-29T05:54:31Z)
From stability of Langevin diffusion to convergence of proximal MCMC for non-log-concave sampling [21.958353508957988]
We consider the problem of sampling discrete distributions from non-backwards with Untime Langevin (ULA)<n>We prove the robustness of ULA to the assumption that the potential is convex at infinity.
arXiv Detail & Related papers (2025-05-20T10:29:57Z)
Mixing Times and Privacy Analysis for the Projected Langevin Algorithm under a Modulus of Continuity [8.475107963160779]
We study the mixing time of the projected Langevin algorithm (LA) and the privacy curve of noisy Gradient Descent (SGD)<n>We derive new mixing time bounds for the projected LA which are, in some important cases, dimension-free and poly-logarithmic.<n>We establish new upper bounds for the privacy curve of the subsampled noisy SGD algorithm.
arXiv Detail & Related papers (2025-01-07T20:46:59Z)
Stochastic Approximation with Delayed Updates: Finite-Time Rates under Markovian Sampling [73.5602474095954]
We study the non-asymptotic performance of approximation schemes with delayed updates under Markovian sampling. Our theoretical findings shed light on the finite-time effects of delays for a broad class of algorithms.
arXiv Detail & Related papers (2024-02-19T03:08:02Z)
Differentiable Annealed Importance Sampling and the Perils of Gradient Noise [68.44523807580438]
Annealed importance sampling (AIS) and related algorithms are highly effective tools for marginal likelihood estimation. Differentiability is a desirable property as it would admit the possibility of optimizing marginal likelihood as an objective. We propose a differentiable algorithm by abandoning Metropolis-Hastings steps, which further unlocks mini-batch computation.
arXiv Detail & Related papers (2021-07-21T17:10:14Z)
Stochastic Gradient Descent-Ascent and Consensus Optimization for Smooth Games: Convergence Analysis under Expected Co-coercivity [49.66890309455787]
We introduce the expected co-coercivity condition, explain its benefits, and provide the first last-iterate convergence guarantees of SGDA and SCO. We prove linear convergence of both methods to a neighborhood of the solution when they use constant step-size. Our convergence guarantees hold under the arbitrary sampling paradigm, and we give insights into the complexity of minibatching.
arXiv Detail & Related papers (2021-06-30T18:32:46Z)
High-probability Bounds for Non-Convex Stochastic Optimization with Heavy Tails [55.561406656549686]
We consider non- Hilbert optimization using first-order algorithms for which the gradient estimates may have tails. We show that a combination of gradient, momentum, and normalized gradient descent convergence to critical points in high-probability with best-known iteration for smooth losses.
arXiv Detail & Related papers (2021-06-28T00:17:01Z)
Navigating to the Best Policy in Markov Decision Processes [68.8204255655161]
We investigate the active pure exploration problem in Markov Decision Processes. Agent sequentially selects actions and, from the resulting system trajectory, aims at the best as fast as possible.
arXiv Detail & Related papers (2021-06-05T09:16:28Z)
Nonlinear Two-Time-Scale Stochastic Approximation: Convergence and Finite-Time Performance [1.52292571922932]
We study the convergence and finite-time analysis of the nonlinear two-time-scale approximation. In particular, we show that the method achieves a convergence in expectation at a rate $mathcalO (1/k2/3)$, where $k$ is the number of iterations.
arXiv Detail & Related papers (2020-11-03T17:43:39Z)
Upper Confidence Primal-Dual Reinforcement Learning for CMDP with Adversarial Loss [145.54544979467872]
We consider online learning for episodically constrained Markov decision processes (CMDPs) We propose a new emphupper confidence primal-dual algorithm, which only requires the trajectories sampled from the transition model. Our analysis incorporates a new high-probability drift analysis of Lagrange multiplier processes into the celebrated regret analysis of upper confidence reinforcement learning.
arXiv Detail & Related papers (2020-03-02T05:02:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.