Related papers: Finite-Sample Analysis of Off-Policy Natural Actor-Critic with Linear Function Approximation

Finite-Sample Analysis of Off-Policy Natural Actor-Critic with Linear Function Approximation

URL: http://arxiv.org/abs/2105.12540v1
Date: Wed, 26 May 2021 13:35:42 GMT
Title: Finite-Sample Analysis of Off-Policy Natural Actor-Critic with Linear Function Approximation
Authors: Zaiwei Chen, Sajad Khodadadian, Siva Theja Maguluri
Abstract summary: We develop a novel variant of off-policy natural actor-critic algorithm with linear function approximation. We establish a sample complexity of $mathcalO(epsilon-3)$, outperforming all the previously known convergence bounds of such algorithms.
Score: 5.543220407902113
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we develop a novel variant of off-policy natural actor-critic algorithm with linear function approximation and we establish a sample complexity of $\mathcal{O}(\epsilon^{-3})$, outperforming all the previously known convergence bounds of such algorithms. In order to overcome the divergence due to deadly triad in off-policy policy evaluation under function approximation, we develop a critic that employs $n$-step TD-learning algorithm with a properly chosen $n$. We present finite-sample convergence bounds on this critic under both constant and diminishing step sizes, which are of independent interest. Furthermore, we develop a variant of natural policy gradient under function approximation, with an improved convergence rate of $\mathcal{O}(1/T)$ after $T$ iterations. Combining the finite sample error bounds of actor and the critic, we obtain the $\mathcal{O}(\epsilon^{-3})$ sample complexity. We derive our sample complexity bounds solely based on the assumption that the behavior policy sufficiently explores all the states and actions, which is a much lighter assumption compared to the related literature.

Related papers

Nearly Optimal Sample Complexity of Offline KL-Regularized Contextual Bandits under Single-Policy Concentrability [49.96531901205305]
We propose the emphfirst algorithm with $tildeO(epsilon-1)$ sample complexity under single-policy concentrability for offline contextual bandits. Our proof leverages the strong convexity of the KL regularization, and the conditional non-negativity of the gap between the true reward and its pessimistic estimator. We extend our algorithm to contextual dueling bandits and achieve a similar nearly optimal sample complexity.
arXiv Detail & Related papers (2025-02-09T22:14:45Z)
Improved Sample Complexity for Global Convergence of Actor-Critic Algorithms [49.19842488693726]
We establish the global convergence of the actor-critic algorithm with a significantly improved sample complexity of $O(epsilon-3)$. Our findings provide theoretical support for many algorithms that rely on constant step sizes.
arXiv Detail & Related papers (2024-10-11T14:46:29Z)
Non-Asymptotic Analysis for Single-Loop (Natural) Actor-Critic with Compatible Function Approximation [18.77565744533582]
Actor-critic (AC) is a powerful method for learning an optimal policy in reinforcement learning. AC converges to an $epsilon+varepsilon_textcritic$ neighborhood of stationary points with the best known sample complexity. This paper analyzes the convergence of both AC and NAC algorithms with compatible function approximation.
arXiv Detail & Related papers (2024-06-03T20:05:04Z)
Low-Switching Policy Gradient with Exploration via Online Sensitivity Sampling [23.989009116398208]
We design a low-switching sample-efficient policy optimization algorithm, LPO, with general non-linear function approximation. We show that, our algorithm obtains an $varepsilon$-optimal policy with only $widetildeO(fractextpoly(d)varepsilon3)$ samples.
arXiv Detail & Related papers (2023-06-15T23:51:46Z)
Differentially-Private Hierarchical Clustering with Provable Approximation Guarantees [79.59010418610625]
We study differentially private approximation algorithms for hierarchical clustering. We show strong lower bounds for the problem: that any $epsilon$-DP algorithm must exhibit $O(|V|2/ epsilon)$-additive error for an input dataset. We propose a private $1+o(1)$ approximation algorithm which also recovers the blocks exactly.
arXiv Detail & Related papers (2023-01-31T19:14:30Z)
A gradient estimator via L1-randomization for online zero-order optimization with two point feedback [93.57603470949266]
We present a novel gradient estimator based on two function evaluation and randomization. We consider two types of assumptions on the noise of the zero-order oracle: canceling noise and adversarial noise. We provide an anytime and completely data-driven algorithm, which is adaptive to all parameters of the problem.
arXiv Detail & Related papers (2022-05-27T11:23:57Z)
On the Convergence Rate of Off-Policy Policy Optimization Methods with Density-Ratio Correction [28.548040329949387]
We study the convergence properties of off-policy policy improvement algorithms with state-action density ratio correction. We present two strategies with finite-time convergence guarantees. We prove that O-SPIM converges to a stationary point with total complexity $O(epsilon-4)$, which matches the convergence rate of some recent actor-critic algorithms in the on-policy setting.
arXiv Detail & Related papers (2021-06-02T07:26:29Z)
Finite-Sample Analysis of Off-Policy Natural Actor-Critic Algorithm [4.932130498861987]
We provide finite-sample convergence guarantees for an off-policy variant of the natural actor-critic (NAC) algorithm based on Importance Sampling. We show that the algorithm converges to a global optimal policy with a sample complexity of $mathcalO(epsilon-3log2(1/epsilon)$ under an appropriate choice of stepsizes.
arXiv Detail & Related papers (2021-02-18T13:22:59Z)
Average-Reward Off-Policy Policy Evaluation with Function Approximation [66.67075551933438]
We consider off-policy policy evaluation with function approximation in average-reward MDPs. bootstrapping is necessary and, along with off-policy learning and FA, results in the deadly triad. We propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting.
arXiv Detail & Related papers (2021-01-08T00:43:04Z)
Improving Sample Complexity Bounds for (Natural) Actor-Critic Algorithms [58.57004511121862]
This paper characterizes the convergence rate and sample complexity of AC and NAC under Markovian sampling. We show that AC and NAC attain orderwise performance improvement over PG and NPG under infinite horizon due to the incorporation of critic.
arXiv Detail & Related papers (2020-04-27T17:11:06Z)
Is Temporal Difference Learning Optimal? An Instance-Dependent Analysis [102.29671176698373]
We address the problem of policy evaluation in discounted decision processes, and provide Markov-dependent guarantees on the $ell_infty$error under a generative model. We establish both and non-asymptotic versions of local minimax lower bounds for policy evaluation, thereby providing an instance-dependent baseline by which to compare algorithms.
arXiv Detail & Related papers (2020-03-16T17:15:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.