Asynchronous Stochastic Approximation and Average-Reward Reinforcement Learning
- URL: http://arxiv.org/abs/2409.03915v2
- Date: Fri, 02 May 2025 06:29:26 GMT
- Title: Asynchronous Stochastic Approximation and Average-Reward Reinforcement Learning
- Authors: Huizhen Yu, Yi Wan, Richard S. Sutton,
- Abstract summary: This paper studies asynchronous approximation algorithms and their theoretical application to reinforcement learning.<n>We first extend Borkar and Meyn's stability proof method to accommodate more general noise conditions, yielding broader convergence guarantees for asynchronous SA.<n>We establish the convergence of an asynchronous SA iteration of Schweitzer's classical relative value algorithm, RVI Q-learning.
- Score: 11.868402302316131
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper studies asynchronous stochastic approximation (SA) algorithms and their theoretical application to reinforcement learning in semi-Markov decision processes (SMDPs) with an average-reward criterion. We first extend Borkar and Meyn's stability proof method to accommodate more general noise conditions, yielding broader convergence guarantees for asynchronous SA. To sharpen the convergence analysis, we further examine shadowing properties in the asynchronous setting, building on a dynamical-systems approach of Hirsch and Bena\"{i}m. Leveraging these SA results, we establish the convergence of an asynchronous SA analogue of Schweitzer's classical relative value iteration algorithm, RVI Q-learning, for finite-space, weakly communicating SMDPs. Moreover, to make full use of these SA results in this application, we introduce new monotonicity conditions for estimating the optimal reward rate in RVI Q-learning. These conditions substantially expand the previously considered algorithmic framework, and we address them with novel arguments in the stability and convergence analysis of RVI Q-learning.
Related papers
- Achieving $\varepsilon^{-2}$ Dependence for Average-Reward Q-Learning with a New Contraction Principle [13.418969441591882]
We present the convergence rates of synchronous and asynchronous Q-learning for average-reward Markov decision processes.<n>At the core of our analysis is the construction of an instance-dependent seminorm and showing that, after a lazy transformation of the Markov decision process, the Bellman operator becomes one-step contractive under this seminorm.
arXiv Detail & Related papers (2026-01-29T05:54:31Z) - Average-reward reinforcement learning in semi-Markov decision processes via relative value iteration [7.465862205471524]
We show that the RVI Q-learning algorithm converges almost surely to a compact, connected subset of solutions to the average-reward optimality equation.<n>To make full use of the SA framework, we introduce new monotonicity conditions for estimating the optimal reward rate.
arXiv Detail & Related papers (2025-12-05T23:49:07Z) - Asynchronous Policy Gradient Aggregation for Efficient Distributed Reinforcement Learning [55.50683337004406]
We introduce two new algorithms, Rennala NIGT and Malenia NIGT, which implement asynchronous policy gradient aggregation.<n>In the homogeneous setting, Rennala NIGT provably improves the total computational and communication complexity while supporting the AllReduce operation.<n>In the heterogeneous setting, Malenia NIGT simultaneously handles asynchronous computations and heterogeneous environments with strictly better theoretical guarantees.
arXiv Detail & Related papers (2025-09-29T05:38:42Z) - Taming Polysemanticity in LLMs: Provable Feature Recovery via Sparse Autoencoders [50.52694757593443]
Existing SAE training algorithms often lack rigorous mathematical guarantees and suffer from practical limitations.<n>We first propose a novel statistical framework for the feature recovery problem, which includes a new notion of feature identifiability.<n>We introduce a new SAE training algorithm based on bias adaptation'', a technique that adaptively adjusts neural network bias parameters to ensure appropriate activation sparsity.
arXiv Detail & Related papers (2025-06-16T20:58:05Z) - Uncertainty quantification for Markov chains with application to temporal difference learning [63.49764856675643]
We develop novel high-dimensional concentration inequalities and Berry-Esseen bounds for vector- and matrix-valued functions of Markov chains.
We analyze the TD learning algorithm, a widely used method for policy evaluation in reinforcement learning.
arXiv Detail & Related papers (2025-02-19T15:33:55Z) - Statistical Inference for Temporal Difference Learning with Linear Function Approximation [62.69448336714418]
Temporal Difference (TD) learning, arguably the most widely used for policy evaluation, serves as a natural framework for this purpose.
In this paper, we study the consistency properties of TD learning with Polyak-Ruppert averaging and linear function approximation, and obtain three significant improvements over existing results.
arXiv Detail & Related papers (2024-10-21T15:34:44Z) - Online Statistical Inference for Time-varying Sample-averaged Q-learning [2.2374171443798034]
This paper introduces a time-varying batch-averaged Q-learning, termed sampleaveraged Q-learning.
We develop a novel framework that provides insights into the normality of the sample-averaged algorithm under mild conditions.
Numerical experiments conducted on classic OpenAI Gym environments show that the time-varying sample-averaged Q-learning method consistently outperforms both single-sample and constant-batch Q-learning.
arXiv Detail & Related papers (2024-10-14T17:17:19Z) - On Convergence of Average-Reward Q-Learning in Weakly Communicating Markov Decision Processes [11.868402302316131]
This paper analyzes reinforcement learning (RL) algorithms for Markov decision processes (MDPs) under the average-reward criterion.
We focus on Q-learning algorithms based on relative value (RVI), which are model-free sets of the iteration RVI method for weakly communicating MDPs.
arXiv Detail & Related papers (2024-08-29T04:57:44Z) - A Note on Stability in Asynchronous Stochastic Approximation without Communication Delays [11.868402302316131]
We study asynchronous approximation algorithms without communication delays.
Our main contribution is a stability proof for these algorithms.
We discuss their application in important average-reward reinforcement learning problems.
arXiv Detail & Related papers (2023-12-22T22:18:13Z) - Federated Conditional Stochastic Optimization [110.513884892319]
Conditional optimization has found in a wide range of machine learning tasks, such as in-variant learning tasks, AUPRC, andAML.
This paper proposes algorithms for distributed federated learning.
arXiv Detail & Related papers (2023-10-04T01:47:37Z) - Stability and Generalization of Stochastic Compositional Gradient
Descent Algorithms [61.59448949684493]
We provide the stability and generalization analysis of compositional descent algorithms built from training examples.
We establish the uniform stability results for two popular compositional gradient descent algorithms, namely SCGD and SCSC.
We derive-independent excess risk bounds for SCGD and SCSC by trade-offing their stability results and optimization errors.
arXiv Detail & Related papers (2023-07-07T02:40:09Z) - The Efficacy of Pessimism in Asynchronous Q-Learning [17.193902915070506]
We develop an algorithmic framework that incorporates the principle of pessimism into asynchronous Q-learning.
This framework leads to, among other things, improved sample efficiency and enhanced adaptivity in the presence of near-expert data.
Our results deliver the first theoretical support for the use of pessimism principle in the presence of Markovian non-i.i.d. data.
arXiv Detail & Related papers (2022-03-14T17:59:01Z) - On the Convergence of Stochastic Extragradient for Bilinear Games with
Restarted Iteration Averaging [96.13485146617322]
We present an analysis of the ExtraGradient (SEG) method with constant step size, and present variations of the method that yield favorable convergence.
We prove that when augmented with averaging, SEG provably converges to the Nash equilibrium, and such a rate is provably accelerated by incorporating a scheduled restarting procedure.
arXiv Detail & Related papers (2021-06-30T17:51:36Z) - A Dynamical Systems Approach for Convergence of the Bayesian EM
Algorithm [59.99439951055238]
We show how (discrete-time) Lyapunov stability theory can serve as a powerful tool to aid, or even lead, in the analysis (and potential design) of optimization algorithms that are not necessarily gradient-based.
The particular ML problem that this paper focuses on is that of parameter estimation in an incomplete-data Bayesian framework via the popular optimization algorithm known as maximum a posteriori expectation-maximization (MAP-EM)
We show that fast convergence (linear or quadratic) is achieved, which could have been difficult to unveil without our adopted S&C approach.
arXiv Detail & Related papers (2020-06-23T01:34:18Z) - Fast Objective & Duality Gap Convergence for Non-Convex Strongly-Concave
Min-Max Problems with PL Condition [52.08417569774822]
This paper focuses on methods for solving smooth non-concave min-max problems, which have received increasing attention due to deep learning (e.g., deep AUC)
arXiv Detail & Related papers (2020-06-12T00:32:21Z) - Joint Stochastic Approximation and Its Application to Learning Discrete
Latent Variable Models [19.07718284287928]
We show that the difficulty of obtaining reliable gradients for the inference model and the drawback of indirectly optimizing the target log-likelihood can be gracefully addressed.
We propose to directly maximize the target log-likelihood and simultaneously minimize the inclusive divergence between the posterior and the inference model.
The resulting learning algorithm is called joint SA (JSA)
arXiv Detail & Related papers (2020-05-28T13:50:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.