Finite-Sample Analysis of Off-Policy Natural Actor-Critic Algorithm
- URL: http://arxiv.org/abs/2102.09318v1
- Date: Thu, 18 Feb 2021 13:22:59 GMT
- Title: Finite-Sample Analysis of Off-Policy Natural Actor-Critic Algorithm
- Authors: Sajad Khodadadian, Zaiwei Chen, and Siva Theja Maguluri
- Abstract summary: We provide finite-sample convergence guarantees for an off-policy variant of the natural actor-critic (NAC) algorithm based on Importance Sampling.
We show that the algorithm converges to a global optimal policy with a sample complexity of $mathcalO(epsilon-3log2(1/epsilon)$ under an appropriate choice of stepsizes.
- Score: 4.932130498861987
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we provide finite-sample convergence guarantees for an
off-policy variant of the natural actor-critic (NAC) algorithm based on
Importance Sampling. In particular, we show that the algorithm converges to a
global optimal policy with a sample complexity of
$\mathcal{O}(\epsilon^{-3}\log^2(1/\epsilon))$ under an appropriate choice of
stepsizes. In order to overcome the issue of large variance due to Importance
Sampling, we propose the $Q$-trace algorithm for the critic, which is inspired
by the V-trace algorithm (Espeholt et al., 2018). This enables us to explicitly
control the bias and variance, and characterize the trade-off between them. As
an advantage of off-policy sampling, a major feature of our result is that we
do not need any additional assumptions, beyond the ergodicity of the Markov
chain induced by the behavior policy.
Related papers
- Nearly Optimal Sample Complexity of Offline KL-Regularized Contextual Bandits under Single-Policy Concentrability [49.96531901205305]
We propose the emphfirst algorithm with $tildeO(epsilon-1)$ sample complexity under single-policy concentrability for offline contextual bandits.
Our proof leverages the strong convexity of the KL regularization, and the conditional non-negativity of the gap between the true reward and its pessimistic estimator.
We extend our algorithm to contextual dueling bandits and achieve a similar nearly optimal sample complexity.
arXiv Detail & Related papers (2025-02-09T22:14:45Z) - Finite-Time Analysis of Three-Timescale Constrained Actor-Critic and Constrained Natural Actor-Critic Algorithms [5.945710235932345]
We consider actor critic and natural actor critic algorithms with function approximation for constrained Markov decision processes.
We carry out a non-asymptotic analysis for both of these algorithms in a non-i.i.d (Markovian) setting.
We also show the results of experiments on three different Safety-Gym environments.
arXiv Detail & Related papers (2023-10-25T05:04:00Z) - Improved Sample Complexity Analysis of Natural Policy Gradient Algorithm
with General Parameterization for Infinite Horizon Discounted Reward Markov
Decision Processes [41.61653528766776]
We propose the Natural Accelerated Policy Gradient (PGAN) algorithm that utilizes an accelerated gradient descent process to obtain the natural policy gradient.
An iteration achieves $mathcalO(epsilon-2)$ sample complexity and $mathcalO(epsilon-1)$ complexity.
In the class of Hessian-free and IS-free algorithms, ANPG beats the best-known sample complexity by a factor of $mathcalO(epsilon-frac12)$ and simultaneously matches their state-of
arXiv Detail & Related papers (2023-10-18T03:00:15Z) - On the Global Convergence of Natural Actor-Critic with Two-layer Neural
Network Parametrization [38.32265770020665]
We study a natural actor-critic algorithm that utilizes neural networks to represent the critic.
Our aim is to establish sample complexity guarantees for this algorithm, achieving a deeper understanding of its performance characteristics.
arXiv Detail & Related papers (2023-06-18T06:22:04Z) - Exploring the Algorithm-Dependent Generalization of AUPRC Optimization
with List Stability [107.65337427333064]
optimization of the Area Under the Precision-Recall Curve (AUPRC) is a crucial problem for machine learning.
In this work, we present the first trial in the single-dependent generalization of AUPRC optimization.
Experiments on three image retrieval datasets on speak to the effectiveness and soundness of our framework.
arXiv Detail & Related papers (2022-09-27T09:06:37Z) - Navigating to the Best Policy in Markov Decision Processes [68.8204255655161]
We investigate the active pure exploration problem in Markov Decision Processes.
Agent sequentially selects actions and, from the resulting system trajectory, aims at the best as fast as possible.
arXiv Detail & Related papers (2021-06-05T09:16:28Z) - Finite-Sample Analysis of Off-Policy Natural Actor-Critic with Linear
Function Approximation [5.543220407902113]
We develop a novel variant of off-policy natural actor-critic algorithm with linear function approximation.
We establish a sample complexity of $mathcalO(epsilon-3)$, outperforming all the previously known convergence bounds of such algorithms.
arXiv Detail & Related papers (2021-05-26T13:35:42Z) - Adaptive Sampling for Best Policy Identification in Markov Decision
Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model.
The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z) - Is Temporal Difference Learning Optimal? An Instance-Dependent Analysis [102.29671176698373]
We address the problem of policy evaluation in discounted decision processes, and provide Markov-dependent guarantees on the $ell_infty$error under a generative model.
We establish both and non-asymptotic versions of local minimax lower bounds for policy evaluation, thereby providing an instance-dependent baseline by which to compare algorithms.
arXiv Detail & Related papers (2020-03-16T17:15:28Z) - The Simulator: Understanding Adaptive Sampling in the
Moderate-Confidence Regime [52.38455827779212]
We propose a novel technique for analyzing adaptive sampling called the em Simulator.
We prove the first instance-based lower bounds the top-k problem which incorporate the appropriate log-factors.
Our new analysis inspires a simple and near-optimal for the best-arm and top-k identification, the first em practical of its kind for the latter problem.
arXiv Detail & Related papers (2017-02-16T23:42:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.