Improving Sample Complexity Bounds for (Natural) Actor-Critic Algorithms
- URL: http://arxiv.org/abs/2004.12956v4
- Date: Fri, 12 Feb 2021 01:00:43 GMT
- Title: Improving Sample Complexity Bounds for (Natural) Actor-Critic Algorithms
- Authors: Tengyu Xu, Zhe Wang, Yingbin Liang
- Abstract summary: This paper characterizes the convergence rate and sample complexity of AC and NAC under Markovian sampling.
We show that AC and NAC attain orderwise performance improvement over PG and NPG under infinite horizon due to the incorporation of critic.
- Score: 58.57004511121862
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The actor-critic (AC) algorithm is a popular method to find an optimal policy
in reinforcement learning. In the infinite horizon scenario, the finite-sample
convergence rate for the AC and natural actor-critic (NAC) algorithms has been
established recently, but under independent and identically distributed
(i.i.d.) sampling and single-sample update at each iteration. In contrast, this
paper characterizes the convergence rate and sample complexity of AC and NAC
under Markovian sampling, with mini-batch data for each iteration, and with
actor having general policy class approximation. We show that the overall
sample complexity for a mini-batch AC to attain an $\epsilon$-accurate
stationary point improves the best known sample complexity of AC by an order of
$\mathcal{O}(\epsilon^{-1}\log(1/\epsilon))$, and the overall sample complexity
for a mini-batch NAC to attain an $\epsilon$-accurate globally optimal point
improves the existing sample complexity of NAC by an order of
$\mathcal{O}(\epsilon^{-1}/\log(1/\epsilon))$. Moreover, the sample complexity
of AC and NAC characterized in this work outperforms that of policy gradient
(PG) and natural policy gradient (NPG) by a factor of
$\mathcal{O}((1-\gamma)^{-3})$ and
$\mathcal{O}((1-\gamma)^{-4}\epsilon^{-1}/\log(1/\epsilon))$, respectively.
This is the first theoretical study establishing that AC and NAC attain
orderwise performance improvement over PG and NPG under infinite horizon due to
the incorporation of critic.
Related papers
- Improved Sample Complexity for Global Convergence of Actor-Critic Algorithms [49.19842488693726]
We establish the global convergence of the actor-critic algorithm with a significantly improved sample complexity of $O(epsilon-3)$.
Our findings provide theoretical support for many algorithms that rely on constant step sizes.
arXiv Detail & Related papers (2024-10-11T14:46:29Z) - Non-Asymptotic Analysis for Single-Loop (Natural) Actor-Critic with Compatible Function Approximation [18.77565744533582]
Actor-critic (AC) is a powerful method for learning an optimal policy in reinforcement learning.
AC converges to an $epsilon+varepsilon_textcritic$ neighborhood of stationary points with the best known sample complexity.
This paper analyzes the convergence of both AC and NAC algorithms with compatible function approximation.
arXiv Detail & Related papers (2024-06-03T20:05:04Z) - Improved Sample Complexity Analysis of Natural Policy Gradient Algorithm
with General Parameterization for Infinite Horizon Discounted Reward Markov
Decision Processes [41.61653528766776]
We propose the Natural Accelerated Policy Gradient (PGAN) algorithm that utilizes an accelerated gradient descent process to obtain the natural policy gradient.
An iteration achieves $mathcalO(epsilon-2)$ sample complexity and $mathcalO(epsilon-1)$ complexity.
In the class of Hessian-free and IS-free algorithms, ANPG beats the best-known sample complexity by a factor of $mathcalO(epsilon-frac12)$ and simultaneously matches their state-of
arXiv Detail & Related papers (2023-10-18T03:00:15Z) - Finite-Time Analysis of Fully Decentralized Single-Timescale
Actor-Critic [4.94128206910124]
We introduce a fully decentralized Actor-Critic (AC) algorithm, where actor, critic, and global reward estimator are updated in an alternating manner.
We show that our algorithm has sample complexity of $tildemathcalO(epsilon-2)$ under Markovian sampling.
We also provide a local action privacy-preserving version of our algorithm and its analysis.
arXiv Detail & Related papers (2022-06-12T13:14:14Z) - Sample Complexity Bounds for Two Timescale Value-based Reinforcement
Learning Algorithms [65.09383385484007]
Two timescale approximation (SA) has been widely used in value-based reinforcement learning algorithms.
We study the non-asymptotic convergence rate of two timescale linear and nonlinear TDC and Greedy-GQ algorithms.
arXiv Detail & Related papers (2020-11-10T11:36:30Z) - Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and
Variance Reduction [63.41789556777387]
Asynchronous Q-learning aims to learn the optimal action-value function (or Q-function) of a Markov decision process (MDP)
We show that the number of samples needed to yield an entrywise $varepsilon$-accurate estimate of the Q-function is at most on the order of $frac1mu_min (1-gamma)5varepsilon2+ fract_mixmu_min (1-gamma)$ up to some logarithmic factor.
arXiv Detail & Related papers (2020-06-04T17:51:00Z) - Non-asymptotic Convergence Analysis of Two Time-scale (Natural)
Actor-Critic Algorithms [58.57004511121862]
Actor-critic (AC) and natural actor-critic (NAC) algorithms are often executed in two ways for finding optimal policies.
We show that two time-scale AC requires the overall sample complexity at the order of $mathcalO(epsilon-2.5log3(epsilon-1))$ to attain an $epsilon$-accurate stationary point.
We develop novel techniques for bounding the bias error of the actor due to dynamically changing Markovian sampling.
arXiv Detail & Related papers (2020-05-07T15:42:31Z) - Non-asymptotic Convergence of Adam-type Reinforcement Learning
Algorithms under Markovian Sampling [56.394284787780364]
This paper provides the first theoretical convergence analysis for two fundamental RL algorithms of policy gradient (PG) and temporal difference (TD) learning.
Under general nonlinear function approximation, PG-AMSGrad with a constant stepsize converges to a neighborhood of a stationary point at the rate of $mathcalO(log T/sqrtT)$.
Under linear function approximation, TD-AMSGrad with a constant stepsize converges to a neighborhood of the global optimum at the rate of $mathcalO(log T/sqrtT
arXiv Detail & Related papers (2020-02-15T00:26:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.