Non-asymptotic Convergence Analysis of Two Time-scale (Natural)
Actor-Critic Algorithms
- URL: http://arxiv.org/abs/2005.03557v2
- Date: Fri, 8 May 2020 02:06:12 GMT
- Title: Non-asymptotic Convergence Analysis of Two Time-scale (Natural)
Actor-Critic Algorithms
- Authors: Tengyu Xu, Zhe Wang, Yingbin Liang
- Abstract summary: Actor-critic (AC) and natural actor-critic (NAC) algorithms are often executed in two ways for finding optimal policies.
We show that two time-scale AC requires the overall sample complexity at the order of $mathcalO(epsilon-2.5log3(epsilon-1))$ to attain an $epsilon$-accurate stationary point.
We develop novel techniques for bounding the bias error of the actor due to dynamically changing Markovian sampling.
- Score: 58.57004511121862
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As an important type of reinforcement learning algorithms, actor-critic (AC)
and natural actor-critic (NAC) algorithms are often executed in two ways for
finding optimal policies. In the first nested-loop design, actor's one update
of policy is followed by an entire loop of critic's updates of the value
function, and the finite-sample analysis of such AC and NAC algorithms have
been recently well established. The second two time-scale design, in which
actor and critic update simultaneously but with different learning rates, has
much fewer tuning parameters than the nested-loop design and is hence
substantially easier to implement. Although two time-scale AC and NAC have been
shown to converge in the literature, the finite-sample convergence rate has not
been established. In this paper, we provide the first such non-asymptotic
convergence rate for two time-scale AC and NAC under Markovian sampling and
with actor having general policy class approximation. We show that two
time-scale AC requires the overall sample complexity at the order of
$\mathcal{O}(\epsilon^{-2.5}\log^3(\epsilon^{-1}))$ to attain an
$\epsilon$-accurate stationary point, and two time-scale NAC requires the
overall sample complexity at the order of
$\mathcal{O}(\epsilon^{-4}\log^2(\epsilon^{-1}))$ to attain an
$\epsilon$-accurate global optimal point. We develop novel techniques for
bounding the bias error of the actor due to dynamically changing Markovian
sampling and for analyzing the convergence rate of the linear critic with
dynamically changing base functions and transition kernel.
Related papers
- Improved Sample Complexity for Global Convergence of Actor-Critic Algorithms [49.19842488693726]
We establish the global convergence of the actor-critic algorithm with a significantly improved sample complexity of $O(epsilon-3)$.
Our findings provide theoretical support for many algorithms that rely on constant step sizes.
arXiv Detail & Related papers (2024-10-11T14:46:29Z) - Non-Asymptotic Analysis for Single-Loop (Natural) Actor-Critic with Compatible Function Approximation [18.77565744533582]
Actor-critic (AC) is a powerful method for learning an optimal policy in reinforcement learning.
AC converges to an $epsilon+varepsilon_textcritic$ neighborhood of stationary points with the best known sample complexity.
This paper analyzes the convergence of both AC and NAC algorithms with compatible function approximation.
arXiv Detail & Related papers (2024-06-03T20:05:04Z) - Two-Timescale Critic-Actor for Average Reward MDPs with Function Approximation [5.945710235932345]
We present the first two-timescale critic-actor algorithm with function approximation in the long-run average reward setting.
A notable feature of our analysis is that unlike recent single-timescale actor-critic algorithms, we present a complete convergence analysis of our scheme.
arXiv Detail & Related papers (2024-02-02T12:48:49Z) - Finite-Time Analysis of Fully Decentralized Single-Timescale
Actor-Critic [4.94128206910124]
We introduce a fully decentralized Actor-Critic (AC) algorithm, where actor, critic, and global reward estimator are updated in an alternating manner.
We show that our algorithm has sample complexity of $tildemathcalO(epsilon-2)$ under Markovian sampling.
We also provide a local action privacy-preserving version of our algorithm and its analysis.
arXiv Detail & Related papers (2022-06-12T13:14:14Z) - Doubly Robust Off-Policy Actor-Critic: Convergence and Optimality [131.45028999325797]
We develop a doubly robust off-policy AC (DR-Off-PAC) for discounted MDP.
DR-Off-PAC adopts a single timescale structure, in which both actor and critics are updated simultaneously with constant stepsize.
We study the finite-time convergence rate and characterize the sample complexity for DR-Off-PAC to attain an $epsilon$-accurate optimal policy.
arXiv Detail & Related papers (2021-02-23T18:56:13Z) - Variance-Reduced Off-Policy TDC Learning: Non-Asymptotic Convergence
Analysis [27.679514676804057]
We develop a variance reduction scheme for the two time-scale TDC algorithm in the off-policy setting.
Experiments demonstrate that the proposed variance-reduced TDC achieves a smaller convergence error than both the conventional TDC and the variance-reduced TD.
arXiv Detail & Related papers (2020-10-26T01:33:05Z) - Single-Timescale Stochastic Nonconvex-Concave Optimization for Smooth
Nonlinear TD Learning [145.54544979467872]
We propose two single-timescale single-loop algorithms that require only one data point each step.
Our results are expressed in a form of simultaneous primal and dual side convergence.
arXiv Detail & Related papers (2020-08-23T20:36:49Z) - A Finite Time Analysis of Two Time-Scale Actor Critic Methods [87.69128666220016]
We provide a non-asymptotic analysis for two time-scale actor-critic methods under non-i.i.d. setting.
We prove that the actor-critic method is guaranteed to find a first-order stationary point.
This is the first work providing finite-time analysis and sample complexity bound for two time-scale actor-critic methods.
arXiv Detail & Related papers (2020-05-04T09:45:18Z) - Improving Sample Complexity Bounds for (Natural) Actor-Critic Algorithms [58.57004511121862]
This paper characterizes the convergence rate and sample complexity of AC and NAC under Markovian sampling.
We show that AC and NAC attain orderwise performance improvement over PG and NPG under infinite horizon due to the incorporation of critic.
arXiv Detail & Related papers (2020-04-27T17:11:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.