Achieving $\varepsilon^{-2}$ Dependence for Average-Reward Q-Learning with a New Contraction Principle
- URL: http://arxiv.org/abs/2601.21301v1
- Date: Thu, 29 Jan 2026 05:54:31 GMT
- Title: Achieving $\varepsilon^{-2}$ Dependence for Average-Reward Q-Learning with a New Contraction Principle
- Authors: Zijun Chen, Zaiwei Chen, Nian Si, Shengbo Wang,
- Abstract summary: We present the convergence rates of synchronous and asynchronous Q-learning for average-reward Markov decision processes.<n>At the core of our analysis is the construction of an instance-dependent seminorm and showing that, after a lazy transformation of the Markov decision process, the Bellman operator becomes one-step contractive under this seminorm.
- Score: 13.418969441591882
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present the convergence rates of synchronous and asynchronous Q-learning for average-reward Markov decision processes, where the absence of contraction poses a fundamental challenge. Existing non-asymptotic results overcome this challenge by either imposing strong assumptions to enforce seminorm contraction or relying on discounted or episodic Markov decision processes as successive approximations, which either require unknown parameters or result in suboptimal sample complexity. In this work, under a reachability assumption, we establish optimal $\widetilde{O}(\varepsilon^{-2})$ sample complexity guarantees (up to logarithmic factors) for a simple variant of synchronous and asynchronous Q-learning that samples from the lazified dynamics, where the system remains in the current state with some fixed probability. At the core of our analysis is the construction of an instance-dependent seminorm and showing that, after a lazy transformation of the Markov decision process, the Bellman operator becomes one-step contractive under this seminorm.
Related papers
- Regret and Sample Complexity of Online Q-Learning via Concentration of Stochastic Approximation with Time-Inhomogeneous Markov Chains [23.565936864449636]
We present the first high-probability regret bound for classical online Q-learning in infinite-horizon discounted Markov decision processes.<n>For sufficiently large gaps, the regret is sublinear, while for small gaps it deteriorates and can approach linear growth.
arXiv Detail & Related papers (2026-02-18T08:47:07Z) - Average-reward reinforcement learning in semi-Markov decision processes via relative value iteration [7.465862205471524]
We show that the RVI Q-learning algorithm converges almost surely to a compact, connected subset of solutions to the average-reward optimality equation.<n>To make full use of the SA framework, we introduce new monotonicity conditions for estimating the optimal reward rate.
arXiv Detail & Related papers (2025-12-05T23:49:07Z) - Finite-Sample Analysis of Policy Evaluation for Robust Average Reward Reinforcement Learning [50.81240969750462]
We present first finite-sample analysis of policy evaluation in robust average Markov Decision Processes (PMDs)<n>We show that the robust Bellman operator is a contraction under a carefully constructed semi-norm, and developing a framework with controlled bias.<n>Our method achieves the order-optimal sample complexity of $tildemathcalO(epsilon-2)$ for robust policy evaluation and robust average reward estimation.
arXiv Detail & Related papers (2025-02-24T03:55:09Z) - Sample Complexity of Linear Quadratic Regulator Without Initial Stability [7.245261469258501]
Inspired by REINFORCE, we introduce a novel receding-horizon algorithm for the Linear Quadratic Regulator (LQR) problem with unknown dynamics.<n>Unlike prior methods, our algorithm avoids reliance on two-point gradient estimates while maintaining the same order of sample complexity.
arXiv Detail & Related papers (2025-02-20T02:44:25Z) - Sample Complexity of Variance-reduced Distributionally Robust Q-learning [17.96094201655567]
This paper presents two novel model-free algorithms, namely the distributionally robust Q-learning and its variance-reduced counterpart, that can effectively learn a robust policy despite distributional shifts.
A series of numerical experiments confirm the theoretical findings and the efficiency of the algorithms in handling distributional shifts.
arXiv Detail & Related papers (2023-05-28T19:40:46Z) - PAPAL: A Provable PArticle-based Primal-Dual ALgorithm for Mixed Nash Equilibrium [58.26573117273626]
We consider the non-AL equilibrium nonconptotic objective function in two-player zero-sum continuous games.
Our novel insights into the particle-based algorithms for continuous distribution strategies are presented.
arXiv Detail & Related papers (2023-03-02T05:08:15Z) - Stochastic Gradient Descent-Ascent and Consensus Optimization for Smooth
Games: Convergence Analysis under Expected Co-coercivity [49.66890309455787]
We introduce the expected co-coercivity condition, explain its benefits, and provide the first last-iterate convergence guarantees of SGDA and SCO.
We prove linear convergence of both methods to a neighborhood of the solution when they use constant step-size.
Our convergence guarantees hold under the arbitrary sampling paradigm, and we give insights into the complexity of minibatching.
arXiv Detail & Related papers (2021-06-30T18:32:46Z) - On the Convergence of Stochastic Extragradient for Bilinear Games with
Restarted Iteration Averaging [96.13485146617322]
We present an analysis of the ExtraGradient (SEG) method with constant step size, and present variations of the method that yield favorable convergence.
We prove that when augmented with averaging, SEG provably converges to the Nash equilibrium, and such a rate is provably accelerated by incorporating a scheduled restarting procedure.
arXiv Detail & Related papers (2021-06-30T17:51:36Z) - Navigating to the Best Policy in Markov Decision Processes [68.8204255655161]
We investigate the active pure exploration problem in Markov Decision Processes.
Agent sequentially selects actions and, from the resulting system trajectory, aims at the best as fast as possible.
arXiv Detail & Related papers (2021-06-05T09:16:28Z) - Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and
Variance Reduction [63.41789556777387]
Asynchronous Q-learning aims to learn the optimal action-value function (or Q-function) of a Markov decision process (MDP)
We show that the number of samples needed to yield an entrywise $varepsilon$-accurate estimate of the Q-function is at most on the order of $frac1mu_min (1-gamma)5varepsilon2+ fract_mixmu_min (1-gamma)$ up to some logarithmic factor.
arXiv Detail & Related papers (2020-06-04T17:51:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.