Average-reward reinforcement learning in semi-Markov decision processes via relative value iteration
- URL: http://arxiv.org/abs/2512.06218v1
- Date: Fri, 05 Dec 2025 23:49:07 GMT
- Title: Average-reward reinforcement learning in semi-Markov decision processes via relative value iteration
- Authors: Huizhen Yu, Yi Wan, Richard S. Sutton,
- Abstract summary: We show that the RVI Q-learning algorithm converges almost surely to a compact, connected subset of solutions to the average-reward optimality equation.<n>To make full use of the SA framework, we introduce new monotonicity conditions for estimating the optimal reward rate.
- Score: 7.465862205471524
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper applies the authors' recent results on asynchronous stochastic approximation (SA) in the Borkar-Meyn framework to reinforcement learning in average-reward semi-Markov decision processes (SMDPs). We establish the convergence of an asynchronous SA analogue of Schweitzer's classical relative value iteration algorithm, RVI Q-learning, for finite-space, weakly communicating SMDPs. In particular, we show that the algorithm converges almost surely to a compact, connected subset of solutions to the average-reward optimality equation, with convergence to a unique, sample path-dependent solution under additional stepsize and asynchrony conditions. Moreover, to make full use of the SA framework, we introduce new monotonicity conditions for estimating the optimal reward rate in RVI Q-learning. These conditions substantially expand the previously considered algorithmic framework and are addressed through novel arguments in the stability and convergence analysis of RVI Q-learning.
Related papers
- Achieving $\varepsilon^{-2}$ Dependence for Average-Reward Q-Learning with a New Contraction Principle [13.418969441591882]
We present the convergence rates of synchronous and asynchronous Q-learning for average-reward Markov decision processes.<n>At the core of our analysis is the construction of an instance-dependent seminorm and showing that, after a lazy transformation of the Markov decision process, the Bellman operator becomes one-step contractive under this seminorm.
arXiv Detail & Related papers (2026-01-29T05:54:31Z) - Stochastic Optimization with Optimal Importance Sampling [49.484190237840714]
We propose an iterative-based algorithm that jointly updates the decision and the IS distribution without requiring time-scale separation between the two.<n>Our method achieves the lowest possible variable variance and guarantees global convergence under convexity of the objective and mild assumptions on the IS distribution family.
arXiv Detail & Related papers (2025-04-04T16:10:18Z) - Asynchronous Stochastic Approximation and Average-Reward Reinforcement Learning [11.868402302316131]
This paper studies asynchronous approximation algorithms and their theoretical application to reinforcement learning.<n>We first extend Borkar and Meyn's stability proof method to accommodate more general noise conditions, yielding broader convergence guarantees for asynchronous SA.<n>We establish the convergence of an asynchronous SA iteration of Schweitzer's classical relative value algorithm, RVI Q-learning.
arXiv Detail & Related papers (2024-09-05T21:23:51Z) - On Convergence of Average-Reward Q-Learning in Weakly Communicating Markov Decision Processes [11.868402302316131]
This paper analyzes reinforcement learning (RL) algorithms for Markov decision processes (MDPs) under the average-reward criterion.
We focus on Q-learning algorithms based on relative value (RVI), which are model-free sets of the iteration RVI method for weakly communicating MDPs.
arXiv Detail & Related papers (2024-08-29T04:57:44Z) - Variance-Reduced Cascade Q-learning: Algorithms and Sample Complexity [3.4376560669160394]
We introduce and analyze a novel model-free algorithm called Variance-Reduced Cascade Q-learning (VRCQ)<n>VRCQ provides superior guarantees in the $ell_infty$-norm compared with the existing model-free approximation-type algorithms.
arXiv Detail & Related papers (2024-08-13T00:34:33Z) - Conditional Mean and Variance Estimation via \textit{k}-NN Algorithm with Automated Variance Selection [9.943131787772323]
We introduce a novel textitk-nearest neighbor (textitk-NN) regression method for joint estimation of the conditional mean and variance.<n>The proposed algorithm preserves the computational efficiency and manifold-learning capabilities of classical non-parametric textitk-NN models.
arXiv Detail & Related papers (2024-02-02T18:54:18Z) - Federated Conditional Stochastic Optimization [110.513884892319]
Conditional optimization has found in a wide range of machine learning tasks, such as in-variant learning tasks, AUPRC, andAML.
This paper proposes algorithms for distributed federated learning.
arXiv Detail & Related papers (2023-10-04T01:47:37Z) - A Deep Unrolling Model with Hybrid Optimization Structure for Hyperspectral Image Deconvolution [50.13564338607482]
We propose a novel optimization framework for the hyperspectral deconvolution problem, called DeepMix.<n>It consists of three distinct modules, namely, a data consistency module, a module that enforces the effect of the handcrafted regularizers, and a denoising module.<n>This work proposes a context aware denoising module designed to sustain the advancements achieved by the cooperative efforts of the other modules.
arXiv Detail & Related papers (2023-06-10T08:25:16Z) - Stochastic Gradient Descent-Ascent and Consensus Optimization for Smooth
Games: Convergence Analysis under Expected Co-coercivity [49.66890309455787]
We introduce the expected co-coercivity condition, explain its benefits, and provide the first last-iterate convergence guarantees of SGDA and SCO.
We prove linear convergence of both methods to a neighborhood of the solution when they use constant step-size.
Our convergence guarantees hold under the arbitrary sampling paradigm, and we give insights into the complexity of minibatching.
arXiv Detail & Related papers (2021-06-30T18:32:46Z) - On the Convergence of Stochastic Extragradient for Bilinear Games with
Restarted Iteration Averaging [96.13485146617322]
We present an analysis of the ExtraGradient (SEG) method with constant step size, and present variations of the method that yield favorable convergence.
We prove that when augmented with averaging, SEG provably converges to the Nash equilibrium, and such a rate is provably accelerated by incorporating a scheduled restarting procedure.
arXiv Detail & Related papers (2021-06-30T17:51:36Z) - Momentum Q-learning with Finite-Sample Convergence Guarantee [49.38471009162477]
This paper analyzes a class of momentum-based Q-learning algorithms with finite-sample guarantee.
We establish the convergence guarantee for MomentumQ with linear function approximations and Markovian sampling.
We demonstrate through various experiments that the proposed MomentumQ outperforms other momentum-based Q-learning algorithms.
arXiv Detail & Related papers (2020-07-30T12:27:03Z) - A Dynamical Systems Approach for Convergence of the Bayesian EM
Algorithm [59.99439951055238]
We show how (discrete-time) Lyapunov stability theory can serve as a powerful tool to aid, or even lead, in the analysis (and potential design) of optimization algorithms that are not necessarily gradient-based.
The particular ML problem that this paper focuses on is that of parameter estimation in an incomplete-data Bayesian framework via the popular optimization algorithm known as maximum a posteriori expectation-maximization (MAP-EM)
We show that fast convergence (linear or quadratic) is achieved, which could have been difficult to unveil without our adopted S&C approach.
arXiv Detail & Related papers (2020-06-23T01:34:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.