An Improved Analysis of (Variance-Reduced) Policy Gradient and Natural
Policy Gradient Methods
- URL: http://arxiv.org/abs/2211.07937v2
- Date: Wed, 16 Nov 2022 05:55:48 GMT
- Title: An Improved Analysis of (Variance-Reduced) Policy Gradient and Natural
Policy Gradient Methods
- Authors: Yanli Liu, Kaiqing Zhang, Tamer Ba\c{s}ar and Wotao Yin
- Abstract summary: We revisit and improve the convergence of policy gradient (PG), natural PG (NPG) methods, and their variance-reduced variants.
We show that a state-of-the-art variance-reduced PG method, which has only been shown to converge to stationary points, converges to the globally optimal value up to some inherent function approximation error.
We have also made variance-reduction for NPG possible, with both global convergence and an efficient finite-sample complexity.
- Score: 40.657905797628786
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we revisit and improve the convergence of policy gradient
(PG), natural PG (NPG) methods, and their variance-reduced variants, under
general smooth policy parametrizations. More specifically, with the Fisher
information matrix of the policy being positive definite: i) we show that a
state-of-the-art variance-reduced PG method, which has only been shown to
converge to stationary points, converges to the globally optimal value up to
some inherent function approximation error due to policy parametrization; ii)
we show that NPG enjoys a lower sample complexity; iii) we propose SRVR-NPG,
which incorporates variance-reduction into the NPG update. Our improvements
follow from an observation that the convergence of (variance-reduced) PG and
NPG methods can improve each other: the stationary convergence analysis of PG
can be applied to NPG as well, and the global convergence analysis of NPG can
help to establish the global convergence of (variance-reduced) PG methods. Our
analysis carefully integrates the advantages of these two lines of works.
Thanks to this improvement, we have also made variance-reduction for NPG
possible, with both global convergence and an efficient finite-sample
complexity.
Related papers
- Global Convergence of Natural Policy Gradient with Hessian-aided
Momentum Variance Reduction [6.320200835271402]
Natural policy gradient (NPG) and its variants are widely-used policy search methods in reinforcement learning.
New NPG variant coined NPG-HM is developed in this paper, which utilizes the Hessian-aided momentum technique for variance reduction.
Experiments on Mujoco-based environments demonstrate the superior performance of NPG-HM over other state-of-the-art policy gradient methods.
arXiv Detail & Related papers (2024-01-02T07:56:17Z) - Accelerated Policy Gradient: On the Convergence Rates of the Nesterov Momentum for Reinforcement Learning [12.987019067098412]
We adapt the celebrated Nesterov's accelerated gradient (NAG) method to policy optimization in Reinforcement Learning (RL)
We prove that APG converges to an optimal policy at rates: (i) $tildeO (1/t2)$ with constant step sizes; (ii) $O(e-ct)$ with exponentially-growing step sizes.
arXiv Detail & Related papers (2023-10-18T11:33:22Z) - Last-Iterate Convergent Policy Gradient Primal-Dual Methods for
Constrained MDPs [107.28031292946774]
We study the problem of computing an optimal policy of an infinite-horizon discounted Markov decision process (constrained MDP)
We develop two single-time-scale policy-based primal-dual algorithms with non-asymptotic convergence of their policy iterates to an optimal constrained policy.
To the best of our knowledge, this work appears to be the first non-asymptotic policy last-iterate convergence result for single-time-scale algorithms in constrained MDPs.
arXiv Detail & Related papers (2023-06-20T17:27:31Z) - The Role of Baselines in Policy Gradient Optimization [83.42050606055822]
We show that the emphstate value baseline allows on-policy.
emphnatural policy gradient (NPG) to converge to a globally optimal.
policy at an $O (1/t) rate gradient.
We find that the primary effect of the value baseline is to textbfreduce the aggressiveness of the updates rather than their variance.
arXiv Detail & Related papers (2023-01-16T06:28:00Z) - Linear Convergence of Natural Policy Gradient Methods with Log-Linear
Policies [115.86431674214282]
We consider infinite-horizon discounted Markov decision processes and study the convergence rates of the natural policy gradient (NPG) and the Q-NPG methods with the log-linear policy class.
We show that both methods attain linear convergence rates and $mathcalO (1/epsilon2)$ sample complexities using a simple, non-adaptive geometrically increasing step size.
arXiv Detail & Related papers (2022-10-04T06:17:52Z) - Optimal Estimation of Off-Policy Policy Gradient via Double Fitted
Iteration [39.250754806600135]
Policy (PG) estimation becomes a challenge when we are not allowed to sample with the target policy.
Conventional methods for off-policy PG estimation often suffer from significant bias or exponentially large variance.
In this paper, we propose the double Fitted PG estimation (FPG) algorithm.
arXiv Detail & Related papers (2022-01-31T20:23:52Z) - On the Global Convergence of Momentum-based Policy Gradient [9.622367651590878]
Policy gradient (PG) methods are popular and efficient for large-scale reinforcement learning.
We study the global convergences of PG methods with momentum terms, which have been demonstrated to be efficient recipes for improving PG methods.
Our work is the first one that obtains global convergence results for the momentum-based PG methods.
arXiv Detail & Related papers (2021-10-19T17:16:29Z) - A general sample complexity analysis of vanilla policy gradient [101.16957584135767]
Policy gradient (PG) is one of the most popular reinforcement learning (RL) problems.
"vanilla" theoretical understanding of PG trajectory is one of the most popular methods for solving RL problems.
arXiv Detail & Related papers (2021-07-23T19:38:17Z) - Fast Global Convergence of Natural Policy Gradient Methods with Entropy
Regularization [44.24881971917951]
Natural policy gradient (NPG) methods are among the most widely used policy optimization algorithms.
We develop convergence guarantees for entropy-regularized NPG methods under softmax parameterization.
Our results accommodate a wide range of learning rates, and shed light upon the role of entropy regularization in enabling fast convergence.
arXiv Detail & Related papers (2020-07-13T17:58:41Z) - Zeroth-order Deterministic Policy Gradient [116.87117204825105]
We introduce Zeroth-order Deterministic Policy Gradient (ZDPG)
ZDPG approximates policy-reward gradients via two-point evaluations of the $Q$function.
New finite sample complexity bounds for ZDPG improve upon existing results by up to two orders of magnitude.
arXiv Detail & Related papers (2020-06-12T16:52:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.