Related papers: A Lyapunov Theory for Finite-Sample Guarantees of Asynchronous Q-Learning and TD-Learning Variants

A Lyapunov Theory for Finite-Sample Guarantees of Asynchronous Q-Learning and TD-Learning Variants

URL: http://arxiv.org/abs/2102.01567v4
Date: Mon, 4 Sep 2023 20:15:15 GMT
Title: A Lyapunov Theory for Finite-Sample Guarantees of Asynchronous Q-Learning and TD-Learning Variants
Authors: Zaiwei Chen, Siva Theja Maguluri, Sanjay Shakkottai, and Karthikeyan Shanmugam
Abstract summary: This paper develops a framework to study finite-sample convergence guarantees of a class of value-based asynchronous RL algorithms. As a by-product, we provide theoretical insights into the bias-variance trade-off, i.e., efficiency of bootstrapping in RL.
Score: 39.28675942566465
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper develops an unified framework to study finite-sample convergence guarantees of a large class of value-based asynchronous reinforcement learning (RL) algorithms. We do this by first reformulating the RL algorithms as \textit{Markovian Stochastic Approximation} (SA) algorithms to solve fixed-point equations. We then develop a Lyapunov analysis and derive mean-square error bounds on the convergence of the Markovian SA. Based on this result, we establish finite-sample mean-square convergence bounds for asynchronous RL algorithms such as $Q$-learning, $n$-step TD, TD$(\lambda)$, and off-policy TD algorithms including V-trace. As a by-product, by analyzing the convergence bounds of $n$-step TD and TD$(\lambda)$, we provide theoretical insights into the bias-variance trade-off, i.e., efficiency of bootstrapping in RL. This was first posed as an open problem in (Sutton, 1999).

Related papers

Uncertainty quantification for Markov chains with application to temporal difference learning [63.49764856675643]
We develop novel high-dimensional concentration inequalities and Berry-Esseen bounds for vector- and matrix-valued functions of Markov chains. We analyze the TD learning algorithm, a widely used method for policy evaluation in reinforcement learning.
arXiv Detail & Related papers (2025-02-19T15:33:55Z)
Analysis of Off-Policy $n$-Step TD-Learning with Linear Function Approximation [6.663174194579773]
This paper analyzes multi-step temporal difference (TD)-learning algorithms within the deadly triad'' scenario. In particular, we prove that $n$-step TD-learning algorithms converge to a solution as the sampling horizon $n$ increases sufficiently. Two $n$-step TD-learning algorithms are proposed and analyzed, which can be seen as the model-free reinforcement learning counterparts of the model-based deterministic algorithms.
arXiv Detail & Related papers (2025-02-13T03:43:13Z)
Finite Time Analysis of Temporal Difference Learning for Mean-Variance in a Discounted MDP [1.0923877073891446]
We consider the problem of policy evaluation for variance in a discounted reward Markov decision process. For this problem, a temporal difference (TD) type learning algorithm with linear function approximation (LFA) exists in the literature. We derive finite sample bounds that hold (i) in the mean-squared sense; and (ii) with high probability, when tail iterate averaging is employed.
arXiv Detail & Related papers (2024-06-12T05:49:53Z)
Improving Sample Efficiency of Model-Free Algorithms for Zero-Sum Markov Games [66.2085181793014]
We show that a model-free stage-based Q-learning algorithm can enjoy the same optimality in the $H$ dependence as model-based algorithms. Our algorithm features a key novel design of updating the reference value functions as the pair of optimistic and pessimistic value functions.
arXiv Detail & Related papers (2023-08-17T08:34:58Z)
Online Learning with Adversaries: A Differential-Inclusion Analysis [52.43460995467893]
We introduce an observation-matrix-based framework for fully asynchronous online Federated Learning with adversaries. Our main result is that the proposed algorithm almost surely converges to the desired mean $mu.$ We derive this convergence using a novel differential-inclusion-based two-timescale analysis.
arXiv Detail & Related papers (2023-04-04T04:32:29Z)
Model-Based Reinforcement Learning with Multinomial Logistic Function Approximation [10.159501412046508]
We study model-based reinforcement learning (RL) for episodic Markov decision processes (MDP) We establish a provably efficient RL algorithm for the MDP whose state transition is given by a multinomial logistic model. To the best of our knowledge, this is the first model-based RL algorithm with multinomial logistic function approximation with provable guarantees.
arXiv Detail & Related papers (2022-12-27T16:25:09Z)
Theoretical Guarantees of Fictitious Discount Algorithms for Episodic Reinforcement Learning and Global Convergence of Policy Gradient Methods [6.7546872379126155]
A common approach is to introduce a fictitious discount factor and use stationary policies for approximations. This paper takes the first step towards analyzing these algorithms. Non-asymptotic convergence guarantees are established for both algorithms.
arXiv Detail & Related papers (2021-09-13T23:36:38Z)
Stochastic Hard Thresholding Algorithms for AUC Maximization [49.00683387735522]
We develop a hard thresholding algorithm for AUC in distributiond classification. We conduct experiments to show the efficiency and effectiveness of the proposed algorithms.
arXiv Detail & Related papers (2020-11-04T16:49:29Z)
Finite-Time Analysis for Double Q-learning [50.50058000948908]
We provide the first non-asymptotic, finite-time analysis for double Q-learning. We show that both synchronous and asynchronous double Q-learning are guaranteed to converge to an $epsilon$-accurate neighborhood of the global optimum.
arXiv Detail & Related papers (2020-09-29T18:48:21Z)
Non-asymptotic Convergence of Adam-type Reinforcement Learning Algorithms under Markovian Sampling [56.394284787780364]
This paper provides the first theoretical convergence analysis for two fundamental RL algorithms of policy gradient (PG) and temporal difference (TD) learning. Under general nonlinear function approximation, PG-AMSGrad with a constant stepsize converges to a neighborhood of a stationary point at the rate of $mathcalO(log T/sqrtT)$. Under linear function approximation, TD-AMSGrad with a constant stepsize converges to a neighborhood of the global optimum at the rate of $mathcalO(log T/sqrtT
arXiv Detail & Related papers (2020-02-15T00:26:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.