Related papers: Finite-Sample Analysis of Off-Policy Natural Actor-Critic Algorithm

Finite-Sample Analysis of Off-Policy Natural Actor-Critic Algorithm

URL: http://arxiv.org/abs/2102.09318v1
Date: Thu, 18 Feb 2021 13:22:59 GMT
Title: Finite-Sample Analysis of Off-Policy Natural Actor-Critic Algorithm
Authors: Sajad Khodadadian, Zaiwei Chen, and Siva Theja Maguluri
Abstract summary: We provide finite-sample convergence guarantees for an off-policy variant of the natural actor-critic (NAC) algorithm based on Importance Sampling. We show that the algorithm converges to a global optimal policy with a sample complexity of $mathcalO(epsilon-3log2(1/epsilon)$ under an appropriate choice of stepsizes.
Score: 4.932130498861987
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we provide finite-sample convergence guarantees for an off-policy variant of the natural actor-critic (NAC) algorithm based on Importance Sampling. In particular, we show that the algorithm converges to a global optimal policy with a sample complexity of $\mathcal{O}(\epsilon^{-3}\log^2(1/\epsilon))$ under an appropriate choice of stepsizes. In order to overcome the issue of large variance due to Importance Sampling, we propose the $Q$-trace algorithm for the critic, which is inspired by the V-trace algorithm (Espeholt et al., 2018). This enables us to explicitly control the bias and variance, and characterize the trade-off between them. As an advantage of off-policy sampling, a major feature of our result is that we do not need any additional assumptions, beyond the ergodicity of the Markov chain induced by the behavior policy.

Related papers

Nearly Optimal Sample Complexity of Offline KL-Regularized Contextual Bandits under Single-Policy Concentrability [49.96531901205305]
We propose the emphfirst algorithm with $tildeO(epsilon-1)$ sample complexity under single-policy concentrability for offline contextual bandits. Our proof leverages the strong convexity of the KL regularization, and the conditional non-negativity of the gap between the true reward and its pessimistic estimator. We extend our algorithm to contextual dueling bandits and achieve a similar nearly optimal sample complexity.
arXiv Detail & Related papers (2025-02-09T22:14:45Z)
A Finite-Sample Analysis of an Actor-Critic Algorithm for Mean-Variance Optimization in a Discounted MDP [1.0923877073891446]
We analyze a Temporal Difference (TD) learning algorithm with linear function approximation (LFA) for policy evaluation. We derive finite-sample bounds that hold (i) in the mean-squared sense and (ii) with high probability under tail iterate averaging. These results establish finite-sample theoretical guarantees for risk-sensitive actor-critic methods in reinforcement learning.
arXiv Detail & Related papers (2024-06-12T05:49:53Z)
Finite-Time Analysis of Three-Timescale Constrained Actor-Critic and Constrained Natural Actor-Critic Algorithms [5.945710235932345]
We consider actor critic and natural actor critic algorithms with function approximation for constrained Markov decision processes. We carry out a non-asymptotic analysis for both of these algorithms in a non-i.i.d (Markovian) setting. We also show the results of experiments on three different Safety-Gym environments.
arXiv Detail & Related papers (2023-10-25T05:04:00Z)
Improved Sample Complexity Analysis of Natural Policy Gradient Algorithm with General Parameterization for Infinite Horizon Discounted Reward Markov Decision Processes [41.61653528766776]
We propose the Natural Accelerated Policy Gradient (PGAN) algorithm that utilizes an accelerated gradient descent process to obtain the natural policy gradient. An iteration achieves $mathcalO(epsilon-2)$ sample complexity and $mathcalO(epsilon-1)$ complexity. In the class of Hessian-free and IS-free algorithms, ANPG beats the best-known sample complexity by a factor of $mathcalO(epsilon-frac12)$ and simultaneously matches their state-of
arXiv Detail & Related papers (2023-10-18T03:00:15Z)
On the Global Convergence of Natural Actor-Critic with Two-layer Neural Network Parametrization [38.32265770020665]
We study a natural actor-critic algorithm that utilizes neural networks to represent the critic. Our aim is to establish sample complexity guarantees for this algorithm, achieving a deeper understanding of its performance characteristics.
arXiv Detail & Related papers (2023-06-18T06:22:04Z)
Exploring the Algorithm-Dependent Generalization of AUPRC Optimization with List Stability [107.65337427333064]
optimization of the Area Under the Precision-Recall Curve (AUPRC) is a crucial problem for machine learning. In this work, we present the first trial in the single-dependent generalization of AUPRC optimization. Experiments on three image retrieval datasets on speak to the effectiveness and soundness of our framework.
arXiv Detail & Related papers (2022-09-27T09:06:37Z)
Towards Sample-Optimal Compressive Phase Retrieval with Sparse and Generative Priors [59.33977545294148]
We show that $O(k log L)$ samples suffice to guarantee that the signal is close to any vector that minimizes an amplitude-based empirical loss function. We adapt this result to sparse phase retrieval, and show that $O(s log n)$ samples are sufficient for a similar guarantee when the underlying signal is $s$-sparse and $n$-dimensional.
arXiv Detail & Related papers (2021-06-29T12:49:54Z)
Navigating to the Best Policy in Markov Decision Processes [68.8204255655161]
We investigate the active pure exploration problem in Markov Decision Processes. Agent sequentially selects actions and, from the resulting system trajectory, aims at the best as fast as possible.
arXiv Detail & Related papers (2021-06-05T09:16:28Z)
Finite-Sample Analysis of Off-Policy Natural Actor-Critic with Linear Function Approximation [5.543220407902113]
We develop a novel variant of off-policy natural actor-critic algorithm with linear function approximation. We establish a sample complexity of $mathcalO(epsilon-3)$, outperforming all the previously known convergence bounds of such algorithms.
arXiv Detail & Related papers (2021-05-26T13:35:42Z)
Adaptive Sampling for Best Policy Identification in Markov Decision Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model. The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z)
Is Temporal Difference Learning Optimal? An Instance-Dependent Analysis [102.29671176698373]
We address the problem of policy evaluation in discounted decision processes, and provide Markov-dependent guarantees on the $ell_infty$error under a generative model. We establish both and non-asymptotic versions of local minimax lower bounds for policy evaluation, thereby providing an instance-dependent baseline by which to compare algorithms.
arXiv Detail & Related papers (2020-03-16T17:15:28Z)
The Simulator: Understanding Adaptive Sampling in the Moderate-Confidence Regime [52.38455827779212]
We propose a novel technique for analyzing adaptive sampling called the em Simulator. We prove the first instance-based lower bounds the top-k problem which incorporate the appropriate log-factors. Our new analysis inspires a simple and near-optimal for the best-arm and top-k identification, the first em practical of its kind for the latter problem.
arXiv Detail & Related papers (2017-02-16T23:42:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.