A Lyapunov Theory for Finite-Sample Guarantees of Asynchronous
Q-Learning and TD-Learning Variants
- URL: http://arxiv.org/abs/2102.01567v4
- Date: Mon, 4 Sep 2023 20:15:15 GMT
- Title: A Lyapunov Theory for Finite-Sample Guarantees of Asynchronous
Q-Learning and TD-Learning Variants
- Authors: Zaiwei Chen, Siva Theja Maguluri, Sanjay Shakkottai, and Karthikeyan
Shanmugam
- Abstract summary: This paper develops a framework to study finite-sample convergence guarantees of a class of value-based asynchronous RL algorithms.
As a by-product, we provide theoretical insights into the bias-variance trade-off, i.e., efficiency of bootstrapping in RL.
- Score: 39.28675942566465
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper develops an unified framework to study finite-sample convergence
guarantees of a large class of value-based asynchronous reinforcement learning
(RL) algorithms. We do this by first reformulating the RL algorithms as
\textit{Markovian Stochastic Approximation} (SA) algorithms to solve
fixed-point equations. We then develop a Lyapunov analysis and derive
mean-square error bounds on the convergence of the Markovian SA. Based on this
result, we establish finite-sample mean-square convergence bounds for
asynchronous RL algorithms such as $Q$-learning, $n$-step TD, TD$(\lambda)$,
and off-policy TD algorithms including V-trace. As a by-product, by analyzing
the convergence bounds of $n$-step TD and TD$(\lambda)$, we provide theoretical
insights into the bias-variance trade-off, i.e., efficiency of bootstrapping in
RL. This was first posed as an open problem in (Sutton, 1999).
Related papers
- Uncertainty quantification for Markov chains with application to temporal difference learning [63.49764856675643]
We develop novel high-dimensional concentration inequalities and Berry-Esseen bounds for vector- and matrix-valued functions of Markov chains.
We analyze the TD learning algorithm, a widely used method for policy evaluation in reinforcement learning.
arXiv Detail & Related papers (2025-02-19T15:33:55Z) - Analysis of Off-Policy $n$-Step TD-Learning with Linear Function Approximation [6.663174194579773]
This paper analyzes multi-step temporal difference (TD)-learning algorithms within the deadly triad'' scenario.
In particular, we prove that $n$-step TD-learning algorithms converge to a solution as the sampling horizon $n$ increases sufficiently.
Two $n$-step TD-learning algorithms are proposed and analyzed, which can be seen as the model-free reinforcement learning counterparts of the model-based deterministic algorithms.
arXiv Detail & Related papers (2025-02-13T03:43:13Z) - Finite Time Analysis of Temporal Difference Learning for Mean-Variance in a Discounted MDP [1.0923877073891446]
We consider the problem of policy evaluation for variance in a discounted reward Markov decision process.
For this problem, a temporal difference (TD) type learning algorithm with linear function approximation (LFA) exists in the literature.
We derive finite sample bounds that hold (i) in the mean-squared sense; and (ii) with high probability, when tail iterate averaging is employed.
arXiv Detail & Related papers (2024-06-12T05:49:53Z) - Improving Sample Efficiency of Model-Free Algorithms for Zero-Sum Markov Games [66.2085181793014]
We show that a model-free stage-based Q-learning algorithm can enjoy the same optimality in the $H$ dependence as model-based algorithms.
Our algorithm features a key novel design of updating the reference value functions as the pair of optimistic and pessimistic value functions.
arXiv Detail & Related papers (2023-08-17T08:34:58Z) - Online Learning with Adversaries: A Differential-Inclusion Analysis [52.43460995467893]
We introduce an observation-matrix-based framework for fully asynchronous online Federated Learning with adversaries.
Our main result is that the proposed algorithm almost surely converges to the desired mean $mu.$
We derive this convergence using a novel differential-inclusion-based two-timescale analysis.
arXiv Detail & Related papers (2023-04-04T04:32:29Z) - Finite-Time Analysis for Double Q-learning [50.50058000948908]
We provide the first non-asymptotic, finite-time analysis for double Q-learning.
We show that both synchronous and asynchronous double Q-learning are guaranteed to converge to an $epsilon$-accurate neighborhood of the global optimum.
arXiv Detail & Related papers (2020-09-29T18:48:21Z) - Non-asymptotic Convergence of Adam-type Reinforcement Learning
Algorithms under Markovian Sampling [56.394284787780364]
This paper provides the first theoretical convergence analysis for two fundamental RL algorithms of policy gradient (PG) and temporal difference (TD) learning.
Under general nonlinear function approximation, PG-AMSGrad with a constant stepsize converges to a neighborhood of a stationary point at the rate of $mathcalO(log T/sqrtT)$.
Under linear function approximation, TD-AMSGrad with a constant stepsize converges to a neighborhood of the global optimum at the rate of $mathcalO(log T/sqrtT
arXiv Detail & Related papers (2020-02-15T00:26:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.