Sample Complexity Bounds for Two Timescale Value-based Reinforcement
Learning Algorithms
- URL: http://arxiv.org/abs/2011.05053v1
- Date: Tue, 10 Nov 2020 11:36:30 GMT
- Title: Sample Complexity Bounds for Two Timescale Value-based Reinforcement
Learning Algorithms
- Authors: Tengyu Xu, Yingbin Liang
- Abstract summary: Two timescale approximation (SA) has been widely used in value-based reinforcement learning algorithms.
We study the non-asymptotic convergence rate of two timescale linear and nonlinear TDC and Greedy-GQ algorithms.
- Score: 65.09383385484007
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Two timescale stochastic approximation (SA) has been widely used in
value-based reinforcement learning algorithms. In the policy evaluation
setting, it can model the linear and nonlinear temporal difference learning
with gradient correction (TDC) algorithms as linear SA and nonlinear SA,
respectively. In the policy optimization setting, two timescale nonlinear SA
can also model the greedy gradient-Q (Greedy-GQ) algorithm. In previous
studies, the non-asymptotic analysis of linear TDC and Greedy-GQ has been
studied in the Markovian setting, with diminishing or accuracy-dependent
stepsize. For the nonlinear TDC algorithm, only the asymptotic convergence has
been established. In this paper, we study the non-asymptotic convergence rate
of two timescale linear and nonlinear TDC and Greedy-GQ under Markovian
sampling and with accuracy-independent constant stepsize. For linear TDC, we
provide a novel non-asymptotic analysis and show that it attains an
$\epsilon$-accurate solution with the optimal sample complexity of
$\mathcal{O}(\epsilon^{-1}\log(1/\epsilon))$ under a constant stepsize. For
nonlinear TDC and Greedy-GQ, we show that both algorithms attain
$\epsilon$-accurate stationary solution with sample complexity
$\mathcal{O}(\epsilon^{-2})$. It is the first non-asymptotic convergence result
established for nonlinear TDC under Markovian sampling and our result for
Greedy-GQ outperforms the previous result orderwisely by a factor of
$\mathcal{O}(\epsilon^{-1}\log(1/\epsilon))$.
Related papers
- Gauss-Newton Temporal Difference Learning with Nonlinear Function Approximation [11.925232472331494]
A Gauss-Newton Temporal Difference (GNTD) learning method is proposed to solve the Q-learning problem with nonlinear function approximation.
In each iteration, our method takes one Gauss-Newton (GN) step to optimize a variant of Mean-Squared Bellman Error (MSBE)
We validate our method via extensive experiments in several RL benchmarks, where GNTD exhibits both higher rewards and faster convergence than TD-type methods.
arXiv Detail & Related papers (2023-02-25T14:14:01Z) - Finite-Time Error Bounds for Greedy-GQ [20.51105692499517]
We show that Greedy-GQ algorithm converges fast as finite-time error.
Our analysis provides for faster convergence step-sizes for choosing step-sizes.
arXiv Detail & Related papers (2022-09-06T15:04:57Z) - Stationary Behavior of Constant Stepsize SGD Type Algorithms: An
Asymptotic Characterization [4.932130498861987]
We study the behavior of the appropriately scaled stationary distribution, in the limit when the constant stepsize goes to zero.
We show that the limiting scaled stationary distribution is a solution of an integral equation.
arXiv Detail & Related papers (2021-11-11T17:39:50Z) - Finite-Sample Analysis for Two Time-scale Non-linear TDC with General
Smooth Function Approximation [27.149240954363496]
We develop novel techniques to explicitly characterize the finite-sample error bound for the general off-policy setting.
Our approach can be applied to a wide range of T-based learning algorithms with general smooth function approximation.
arXiv Detail & Related papers (2021-04-07T00:34:11Z) - Variance-Reduced Off-Policy TDC Learning: Non-Asymptotic Convergence
Analysis [27.679514676804057]
We develop a variance reduction scheme for the two time-scale TDC algorithm in the off-policy setting.
Experiments demonstrate that the proposed variance-reduced TDC achieves a smaller convergence error than both the conventional TDC and the variance-reduced TD.
arXiv Detail & Related papers (2020-10-26T01:33:05Z) - Single-Timescale Stochastic Nonconvex-Concave Optimization for Smooth
Nonlinear TD Learning [145.54544979467872]
We propose two single-timescale single-loop algorithms that require only one data point each step.
Our results are expressed in a form of simultaneous primal and dual side convergence.
arXiv Detail & Related papers (2020-08-23T20:36:49Z) - Non-asymptotic Convergence Analysis of Two Time-scale (Natural)
Actor-Critic Algorithms [58.57004511121862]
Actor-critic (AC) and natural actor-critic (NAC) algorithms are often executed in two ways for finding optimal policies.
We show that two time-scale AC requires the overall sample complexity at the order of $mathcalO(epsilon-2.5log3(epsilon-1))$ to attain an $epsilon$-accurate stationary point.
We develop novel techniques for bounding the bias error of the actor due to dynamically changing Markovian sampling.
arXiv Detail & Related papers (2020-05-07T15:42:31Z) - Second-order Conditional Gradient Sliding [79.66739383117232]
We present the emphSecond-Order Conditional Gradient Sliding (SOCGS) algorithm.
The SOCGS algorithm converges quadratically in primal gap after a finite number of linearly convergent iterations.
It is useful when the feasible region can only be accessed efficiently through a linear optimization oracle.
arXiv Detail & Related papers (2020-02-20T17:52:18Z) - Non-asymptotic Convergence of Adam-type Reinforcement Learning
Algorithms under Markovian Sampling [56.394284787780364]
This paper provides the first theoretical convergence analysis for two fundamental RL algorithms of policy gradient (PG) and temporal difference (TD) learning.
Under general nonlinear function approximation, PG-AMSGrad with a constant stepsize converges to a neighborhood of a stationary point at the rate of $mathcalO(log T/sqrtT)$.
Under linear function approximation, TD-AMSGrad with a constant stepsize converges to a neighborhood of the global optimum at the rate of $mathcalO(log T/sqrtT
arXiv Detail & Related papers (2020-02-15T00:26:49Z) - Towards Better Understanding of Adaptive Gradient Algorithms in
Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks.
In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems.
Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.