A Lyapunov Theory for Finite-Sample Guarantees of Asynchronous
Q-Learning and TD-Learning Variants
- URL: http://arxiv.org/abs/2102.01567v4
- Date: Mon, 4 Sep 2023 20:15:15 GMT
- Title: A Lyapunov Theory for Finite-Sample Guarantees of Asynchronous
Q-Learning and TD-Learning Variants
- Authors: Zaiwei Chen, Siva Theja Maguluri, Sanjay Shakkottai, and Karthikeyan
Shanmugam
- Abstract summary: This paper develops a framework to study finite-sample convergence guarantees of a class of value-based asynchronous RL algorithms.
As a by-product, we provide theoretical insights into the bias-variance trade-off, i.e., efficiency of bootstrapping in RL.
- Score: 39.28675942566465
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper develops an unified framework to study finite-sample convergence
guarantees of a large class of value-based asynchronous reinforcement learning
(RL) algorithms. We do this by first reformulating the RL algorithms as
\textit{Markovian Stochastic Approximation} (SA) algorithms to solve
fixed-point equations. We then develop a Lyapunov analysis and derive
mean-square error bounds on the convergence of the Markovian SA. Based on this
result, we establish finite-sample mean-square convergence bounds for
asynchronous RL algorithms such as $Q$-learning, $n$-step TD, TD$(\lambda)$,
and off-policy TD algorithms including V-trace. As a by-product, by analyzing
the convergence bounds of $n$-step TD and TD$(\lambda)$, we provide theoretical
insights into the bias-variance trade-off, i.e., efficiency of bootstrapping in
RL. This was first posed as an open problem in (Sutton, 1999).
Related papers
- Finite Time Analysis of Temporal Difference Learning for Mean-Variance in a Discounted MDP [1.0923877073891446]
We consider the problem of policy evaluation for variance in a discounted reward Markov decision process.
For this problem, a temporal difference (TD) type learning algorithm with linear function approximation (LFA) exists in the literature.
We derive finite sample bounds that hold (i) in the mean-squared sense; and (ii) with high probability, when tail iterate averaging is employed.
arXiv Detail & Related papers (2024-06-12T05:49:53Z) - Improving Sample Efficiency of Model-Free Algorithms for Zero-Sum Markov Games [66.2085181793014]
We show that a model-free stage-based Q-learning algorithm can enjoy the same optimality in the $H$ dependence as model-based algorithms.
Our algorithm features a key novel design of updating the reference value functions as the pair of optimistic and pessimistic value functions.
arXiv Detail & Related papers (2023-08-17T08:34:58Z) - Online Learning with Adversaries: A Differential-Inclusion Analysis [52.43460995467893]
We introduce an observation-matrix-based framework for fully asynchronous online Federated Learning with adversaries.
Our main result is that the proposed algorithm almost surely converges to the desired mean $mu.$
We derive this convergence using a novel differential-inclusion-based two-timescale analysis.
arXiv Detail & Related papers (2023-04-04T04:32:29Z) - Model-Based Reinforcement Learning with Multinomial Logistic Function
Approximation [12.36108042107798]
We study model-based reinforcement learning for episodic Markov decision processes.
We establish a provably efficient RL algorithm for the MDP whose state transition is given by a multinomial logistic model.
We show that our proposed algorithm consistently outperforms the existing methods.
arXiv Detail & Related papers (2022-12-27T16:25:09Z) - Theoretical Guarantees of Fictitious Discount Algorithms for Episodic
Reinforcement Learning and Global Convergence of Policy Gradient Methods [6.7546872379126155]
A common approach is to introduce a fictitious discount factor and use stationary policies for approximations.
This paper takes the first step towards analyzing these algorithms.
Non-asymptotic convergence guarantees are established for both algorithms.
arXiv Detail & Related papers (2021-09-13T23:36:38Z) - Stochastic Hard Thresholding Algorithms for AUC Maximization [49.00683387735522]
We develop a hard thresholding algorithm for AUC in distributiond classification.
We conduct experiments to show the efficiency and effectiveness of the proposed algorithms.
arXiv Detail & Related papers (2020-11-04T16:49:29Z) - Finite-Time Analysis for Double Q-learning [50.50058000948908]
We provide the first non-asymptotic, finite-time analysis for double Q-learning.
We show that both synchronous and asynchronous double Q-learning are guaranteed to converge to an $epsilon$-accurate neighborhood of the global optimum.
arXiv Detail & Related papers (2020-09-29T18:48:21Z) - Non-asymptotic Convergence of Adam-type Reinforcement Learning
Algorithms under Markovian Sampling [56.394284787780364]
This paper provides the first theoretical convergence analysis for two fundamental RL algorithms of policy gradient (PG) and temporal difference (TD) learning.
Under general nonlinear function approximation, PG-AMSGrad with a constant stepsize converges to a neighborhood of a stationary point at the rate of $mathcalO(log T/sqrtT)$.
Under linear function approximation, TD-AMSGrad with a constant stepsize converges to a neighborhood of the global optimum at the rate of $mathcalO(log T/sqrtT
arXiv Detail & Related papers (2020-02-15T00:26:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.