Stochastic Recursive Momentum for Policy Gradient Methods
- URL: http://arxiv.org/abs/2003.04302v1
- Date: Mon, 9 Mar 2020 17:59:03 GMT
- Title: Stochastic Recursive Momentum for Policy Gradient Methods
- Authors: Huizhuo Yuan, Xiangru Lian, Ji Liu, Yuren Zhou
- Abstract summary: We propose a novel algorithm named STOchastic Recursive Momentum for Policy Gradient (Storm-PG)
Storm-PG enjoys a provably sharp $O (1/epsilon3)$ sample bound for STORM-PG, matching the best-known convergence rate for policy gradient algorithm.
Numerical experiments depicts the superiority of our algorithm over comparative policy gradient algorithms.
- Score: 28.277961340108313
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose a novel algorithm named STOchastic Recursive
Momentum for Policy Gradient (STORM-PG), which operates a SARAH-type stochastic
recursive variance-reduced policy gradient in an exponential moving average
fashion. STORM-PG enjoys a provably sharp $O(1/\epsilon^3)$ sample complexity
bound for STORM-PG, matching the best-known convergence rate for policy
gradient algorithm. In the mean time, STORM-PG avoids the alternations between
large batches and small batches which persists in comparable variance-reduced
policy gradient methods, allowing considerably simpler parameter tuning.
Numerical experiments depicts the superiority of our algorithm over comparative
policy gradient algorithms.
Related papers
- Global Convergence of Natural Policy Gradient with Hessian-aided
Momentum Variance Reduction [6.320200835271402]
Natural policy gradient (NPG) and its variants are widely-used policy search methods in reinforcement learning.
New NPG variant coined NPG-HM is developed in this paper, which utilizes the Hessian-aided momentum technique for variance reduction.
Experiments on Mujoco-based environments demonstrate the superior performance of NPG-HM over other state-of-the-art policy gradient methods.
arXiv Detail & Related papers (2024-01-02T07:56:17Z) - Adaptive Policy Learning to Additional Tasks [3.43814540650436]
This paper develops a policy learning method for tuning a pre-trained policy to adapt to additional tasks without altering the original task.
A method named Adaptive Policy Gradient (APG) is proposed in this paper, which combines Bellman's principle of optimality with the policy gradient approach to improve the convergence rate.
arXiv Detail & Related papers (2023-05-24T14:31:11Z) - Anchor-Changing Regularized Natural Policy Gradient for Multi-Objective
Reinforcement Learning [17.916366827429034]
We study policy optimization for Markov decision processes (MDPs) with multiple reward value functions.
We propose an Anchor-changing Regularized Natural Policy Gradient framework, which can incorporate ideas from well-performing first-order methods.
arXiv Detail & Related papers (2022-06-10T21:09:44Z) - PAGE-PG: A Simple and Loopless Variance-Reduced Policy Gradient Method
with Probabilistic Gradient Estimation [6.063525456640462]
We propose a novel loopless variance-reduced policy gradient method based on a probabilistic switch between two types of updates.
We show that our method enjoys a $mathcalOleft( epsilon-3 right)$ average sample complexity to reach an $epsilon$-stationary solution.
A numerical evaluation confirms the competitive performance of our method on classical control tasks.
arXiv Detail & Related papers (2022-02-01T10:10:49Z) - Bregman Gradient Policy Optimization [97.73041344738117]
We design a Bregman gradient policy optimization for reinforcement learning based on Bregman divergences and momentum techniques.
VR-BGPO reaches the best complexity $tilde(epsilon-3)$ for finding an $epsilon$stationary point only requiring one trajectory at each iteration.
arXiv Detail & Related papers (2021-06-23T01:08:54Z) - Learning Sampling Policy for Faster Derivative Free Optimization [100.27518340593284]
We propose a new reinforcement learning based ZO algorithm (ZO-RL) with learning the sampling policy for generating the perturbations in ZO optimization instead of using random sampling.
Our results show that our ZO-RL algorithm can effectively reduce the variances of ZO gradient by learning a sampling policy, and converge faster than existing ZO algorithms in different scenarios.
arXiv Detail & Related papers (2021-04-09T14:50:59Z) - Softmax Policy Gradient Methods Can Take Exponential Time to Converge [60.98700344526674]
The softmax policy gradient (PG) method is arguably one of the de facto implementations of policy optimization in modern reinforcement learning.
We demonstrate that softmax PG methods can take exponential time -- in terms of $mathcalS|$ and $frac11-gamma$ -- to converge.
arXiv Detail & Related papers (2021-02-22T18:56:26Z) - Variance-Reduced Off-Policy Memory-Efficient Policy Search [61.23789485979057]
Off-policy policy optimization is a challenging problem in reinforcement learning.
Off-policy algorithms are memory-efficient and capable of learning from off-policy samples.
arXiv Detail & Related papers (2020-09-14T16:22:46Z) - Fast Global Convergence of Natural Policy Gradient Methods with Entropy
Regularization [44.24881971917951]
Natural policy gradient (NPG) methods are among the most widely used policy optimization algorithms.
We develop convergence guarantees for entropy-regularized NPG methods under softmax parameterization.
Our results accommodate a wide range of learning rates, and shed light upon the role of entropy regularization in enabling fast convergence.
arXiv Detail & Related papers (2020-07-13T17:58:41Z) - Deep Bayesian Quadrature Policy Optimization [100.81242753620597]
Deep Bayesian quadrature policy gradient (DBQPG) is a high-dimensional generalization of Bayesian quadrature for policy gradient estimation.
We show that DBQPG can substitute Monte-Carlo estimation in policy gradient methods, and demonstrate its effectiveness on a set of continuous control benchmarks.
arXiv Detail & Related papers (2020-06-28T15:44:47Z) - Zeroth-order Deterministic Policy Gradient [116.87117204825105]
We introduce Zeroth-order Deterministic Policy Gradient (ZDPG)
ZDPG approximates policy-reward gradients via two-point evaluations of the $Q$function.
New finite sample complexity bounds for ZDPG improve upon existing results by up to two orders of magnitude.
arXiv Detail & Related papers (2020-06-12T16:52:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.