On the Global Convergence of Momentum-based Policy Gradient
- URL: http://arxiv.org/abs/2110.10116v1
- Date: Tue, 19 Oct 2021 17:16:29 GMT
- Title: On the Global Convergence of Momentum-based Policy Gradient
- Authors: Yuhao Ding, Junzi Zhang, Javad Lavaei
- Abstract summary: Policy gradient (PG) methods are popular and efficient for large-scale reinforcement learning.
We study the global convergences of PG methods with momentum terms, which have been demonstrated to be efficient recipes for improving PG methods.
Our work is the first one that obtains global convergence results for the momentum-based PG methods.
- Score: 9.622367651590878
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Policy gradient (PG) methods are popular and efficient for large-scale
reinforcement learning due to their relative stability and incremental nature.
In recent years, the empirical success of PG methods has led to the development
of a theoretical foundation for these methods. In this work, we generalize this
line of research by studying the global convergence of stochastic PG methods
with momentum terms, which have been demonstrated to be efficient recipes for
improving PG methods. We study both the soft-max and the Fisher-non-degenerate
policy parametrizations, and show that adding a momentum improves the global
optimality sample complexity of vanilla PG methods by
$\tilde{\mathcal{O}}(\epsilon^{-1.5})$ and
$\tilde{\mathcal{O}}(\epsilon^{-1})$, respectively, where $\epsilon>0$ is the
target tolerance. Our work is the first one that obtains global convergence
results for the momentum-based PG methods. For the generic
Fisher-non-degenerate policy parametrizations, our result is the first
single-loop and finite-batch PG algorithm achieving $\tilde{O}(\epsilon^{-3})$
global optimality sample complexity. Finally, as a by-product, our methods also
provide general framework for analyzing the global convergence rates of
stochastic PG methods, which can be easily applied and extended to different PG
estimators.
Related papers
- Accelerated Policy Gradient: On the Convergence Rates of the Nesterov Momentum for Reinforcement Learning [12.987019067098412]
We adapt the celebrated Nesterov's accelerated gradient (NAG) method to policy optimization in Reinforcement Learning (RL)
We prove that APG converges to an optimal policy at rates: (i) $tildeO (1/t2)$ with constant step sizes; (ii) $O(e-ct)$ with exponentially-growing step sizes.
arXiv Detail & Related papers (2023-10-18T11:33:22Z) - Stochastic Policy Gradient Methods: Improved Sample Complexity for
Fisher-non-degenerate Policies [19.779044926914704]
We develop improved global convergence guarantees for a general class of Fisher-non-degenerate parameterized policies.
In this work, we propose a Normalized Policy Gradient method with Implicit Gradient Transport (N-PG-IGT) and derive a $tildemathcalO(varepsilon-2.5)$ sample complexity of this method.
We further improve this complexity to $tilde mathcalmathcalO (varepsilon-2)$ by considering a Hessian-Aided Recursive Policy Gradient
arXiv Detail & Related papers (2023-02-03T13:50:23Z) - The Role of Baselines in Policy Gradient Optimization [83.42050606055822]
We show that the emphstate value baseline allows on-policy.
emphnatural policy gradient (NPG) to converge to a globally optimal.
policy at an $O (1/t) rate gradient.
We find that the primary effect of the value baseline is to textbfreduce the aggressiveness of the updates rather than their variance.
arXiv Detail & Related papers (2023-01-16T06:28:00Z) - An Improved Analysis of (Variance-Reduced) Policy Gradient and Natural
Policy Gradient Methods [40.657905797628786]
We revisit and improve the convergence of policy gradient (PG), natural PG (NPG) methods, and their variance-reduced variants.
We show that a state-of-the-art variance-reduced PG method, which has only been shown to converge to stationary points, converges to the globally optimal value up to some inherent function approximation error.
We have also made variance-reduction for NPG possible, with both global convergence and an efficient finite-sample complexity.
arXiv Detail & Related papers (2022-11-15T06:47:06Z) - Beyond Exact Gradients: Convergence of Stochastic Soft-Max Policy Gradient Methods with Entropy Regularization [20.651913793555163]
We revisit the classical entropy regularized policy gradient methods with the soft-max policy parametrization.
We establish a global optimality convergence result and a sample complexity of $widetildemathcalO(frac1epsilon2)$ for the proposed algorithm.
arXiv Detail & Related papers (2021-10-19T17:21:09Z) - A general sample complexity analysis of vanilla policy gradient [101.16957584135767]
Policy gradient (PG) is one of the most popular reinforcement learning (RL) problems.
"vanilla" theoretical understanding of PG trajectory is one of the most popular methods for solving RL problems.
arXiv Detail & Related papers (2021-07-23T19:38:17Z) - Cautiously Optimistic Policy Optimization and Exploration with Linear
Function Approximation [48.744735294559824]
Policy optimization methods are popular reinforcement learning algorithms, because their incremental and on-policy nature makes them more stable than the value-based counterparts.
In this paper, we propose a new algorithm, COPOE, that overcomes the sample complexity issue of PCPG while retaining its robustness to model misspecification.
The result is an improvement in sample complexity from $widetildeO (1/epsilon11)$ for PCPG to $widetildeO (1/epsilon3)$ for PCPG, nearly bridging the gap with value-based techniques.
arXiv Detail & Related papers (2021-03-24T01:42:59Z) - Softmax Policy Gradient Methods Can Take Exponential Time to Converge [60.98700344526674]
The softmax policy gradient (PG) method is arguably one of the de facto implementations of policy optimization in modern reinforcement learning.
We demonstrate that softmax PG methods can take exponential time -- in terms of $mathcalS|$ and $frac11-gamma$ -- to converge.
arXiv Detail & Related papers (2021-02-22T18:56:26Z) - On the Convergence and Sample Efficiency of Variance-Reduced Policy
Gradient Method [38.34416337932712]
Policy gives rise to a rich class of reinforcement learning (RL) methods, for example the REINFORCE.
Yet the best known sample complexity result for such methods to find an $epsilon$-optimal policy is $mathcalO(epsilon-3)$, which is suboptimal.
We study the fundamental convergence properties and sample efficiency of first-order policy optimization method.
arXiv Detail & Related papers (2021-02-17T07:06:19Z) - Momentum-Based Policy Gradient Methods [133.53164856723782]
We propose a class of efficient momentum-based policy gradient methods for the model-free reinforcement learning.
In particular, we present a non-adaptive version of IS-MBPG method, which also reaches the best known sample complexity of $O(epsilon-3)$ without any large batches.
arXiv Detail & Related papers (2020-07-13T20:44:15Z) - Zeroth-order Deterministic Policy Gradient [116.87117204825105]
We introduce Zeroth-order Deterministic Policy Gradient (ZDPG)
ZDPG approximates policy-reward gradients via two-point evaluations of the $Q$function.
New finite sample complexity bounds for ZDPG improve upon existing results by up to two orders of magnitude.
arXiv Detail & Related papers (2020-06-12T16:52:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.