Finite-Sample Analysis of Off-Policy Natural Actor-Critic with Linear
Function Approximation
- URL: http://arxiv.org/abs/2105.12540v1
- Date: Wed, 26 May 2021 13:35:42 GMT
- Title: Finite-Sample Analysis of Off-Policy Natural Actor-Critic with Linear
Function Approximation
- Authors: Zaiwei Chen, Sajad Khodadadian, Siva Theja Maguluri
- Abstract summary: We develop a novel variant of off-policy natural actor-critic algorithm with linear function approximation.
We establish a sample complexity of $mathcalO(epsilon-3)$, outperforming all the previously known convergence bounds of such algorithms.
- Score: 5.543220407902113
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we develop a novel variant of off-policy natural actor-critic
algorithm with linear function approximation and we establish a sample
complexity of $\mathcal{O}(\epsilon^{-3})$, outperforming all the previously
known convergence bounds of such algorithms. In order to overcome the
divergence due to deadly triad in off-policy policy evaluation under function
approximation, we develop a critic that employs $n$-step TD-learning algorithm
with a properly chosen $n$. We present finite-sample convergence bounds on this
critic under both constant and diminishing step sizes, which are of independent
interest. Furthermore, we develop a variant of natural policy gradient under
function approximation, with an improved convergence rate of $\mathcal{O}(1/T)$
after $T$ iterations. Combining the finite sample error bounds of actor and the
critic, we obtain the $\mathcal{O}(\epsilon^{-3})$ sample complexity. We derive
our sample complexity bounds solely based on the assumption that the behavior
policy sufficiently explores all the states and actions, which is a much
lighter assumption compared to the related literature.
Related papers
- Improved Sample Complexity for Global Convergence of Actor-Critic Algorithms [49.19842488693726]
We establish the global convergence of the actor-critic algorithm with a significantly improved sample complexity of $O(epsilon-3)$.
Our findings provide theoretical support for many algorithms that rely on constant step sizes.
arXiv Detail & Related papers (2024-10-11T14:46:29Z) - Non-Asymptotic Analysis for Single-Loop (Natural) Actor-Critic with Compatible Function Approximation [18.77565744533582]
Actor-critic (AC) is a powerful method for learning an optimal policy in reinforcement learning.
AC converges to an $epsilon+varepsilon_textcritic$ neighborhood of stationary points with the best known sample complexity.
This paper analyzes the convergence of both AC and NAC algorithms with compatible function approximation.
arXiv Detail & Related papers (2024-06-03T20:05:04Z) - Low-Switching Policy Gradient with Exploration via Online Sensitivity
Sampling [23.989009116398208]
We design a low-switching sample-efficient policy optimization algorithm, LPO, with general non-linear function approximation.
We show that, our algorithm obtains an $varepsilon$-optimal policy with only $widetildeO(fractextpoly(d)varepsilon3)$ samples.
arXiv Detail & Related papers (2023-06-15T23:51:46Z) - Differentially-Private Hierarchical Clustering with Provable
Approximation Guarantees [79.59010418610625]
We study differentially private approximation algorithms for hierarchical clustering.
We show strong lower bounds for the problem: that any $epsilon$-DP algorithm must exhibit $O(|V|2/ epsilon)$-additive error for an input dataset.
We propose a private $1+o(1)$ approximation algorithm which also recovers the blocks exactly.
arXiv Detail & Related papers (2023-01-31T19:14:30Z) - A gradient estimator via L1-randomization for online zero-order
optimization with two point feedback [93.57603470949266]
We present a novel gradient estimator based on two function evaluation and randomization.
We consider two types of assumptions on the noise of the zero-order oracle: canceling noise and adversarial noise.
We provide an anytime and completely data-driven algorithm, which is adaptive to all parameters of the problem.
arXiv Detail & Related papers (2022-05-27T11:23:57Z) - On the Convergence Rate of Off-Policy Policy Optimization Methods with
Density-Ratio Correction [28.548040329949387]
We study the convergence properties of off-policy policy improvement algorithms with state-action density ratio correction.
We present two strategies with finite-time convergence guarantees.
We prove that O-SPIM converges to a stationary point with total complexity $O(epsilon-4)$, which matches the convergence rate of some recent actor-critic algorithms in the on-policy setting.
arXiv Detail & Related papers (2021-06-02T07:26:29Z) - Finite-Sample Analysis of Off-Policy Natural Actor-Critic Algorithm [4.932130498861987]
We provide finite-sample convergence guarantees for an off-policy variant of the natural actor-critic (NAC) algorithm based on Importance Sampling.
We show that the algorithm converges to a global optimal policy with a sample complexity of $mathcalO(epsilon-3log2(1/epsilon)$ under an appropriate choice of stepsizes.
arXiv Detail & Related papers (2021-02-18T13:22:59Z) - Average-Reward Off-Policy Policy Evaluation with Function Approximation [66.67075551933438]
We consider off-policy policy evaluation with function approximation in average-reward MDPs.
bootstrapping is necessary and, along with off-policy learning and FA, results in the deadly triad.
We propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting.
arXiv Detail & Related papers (2021-01-08T00:43:04Z) - Improving Sample Complexity Bounds for (Natural) Actor-Critic Algorithms [58.57004511121862]
This paper characterizes the convergence rate and sample complexity of AC and NAC under Markovian sampling.
We show that AC and NAC attain orderwise performance improvement over PG and NPG under infinite horizon due to the incorporation of critic.
arXiv Detail & Related papers (2020-04-27T17:11:06Z) - Is Temporal Difference Learning Optimal? An Instance-Dependent Analysis [102.29671176698373]
We address the problem of policy evaluation in discounted decision processes, and provide Markov-dependent guarantees on the $ell_infty$error under a generative model.
We establish both and non-asymptotic versions of local minimax lower bounds for policy evaluation, thereby providing an instance-dependent baseline by which to compare algorithms.
arXiv Detail & Related papers (2020-03-16T17:15:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.