Bregman Gradient Policy Optimization
- URL: http://arxiv.org/abs/2106.12112v1
- Date: Wed, 23 Jun 2021 01:08:54 GMT
- Title: Bregman Gradient Policy Optimization
- Authors: Feihu Huang, Shangqian Gao, Heng Huang
- Abstract summary: We design a Bregman gradient policy optimization for reinforcement learning based on Bregman divergences and momentum techniques.
VR-BGPO reaches the best complexity $tilde(epsilon-3)$ for finding an $epsilon$stationary point only requiring one trajectory at each iteration.
- Score: 97.73041344738117
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In this paper, we design a novel Bregman gradient policy optimization
framework for reinforcement learning based on Bregman divergences and momentum
techniques. Specifically, we propose a Bregman gradient policy optimization
(BGPO) algorithm based on the basic momentum technique and mirror descent
iteration. At the same time, we present an accelerated Bregman gradient policy
optimization (VR-BGPO) algorithm based on a momentum variance-reduced
technique. Moreover, we introduce a convergence analysis framework for our
Bregman gradient policy optimization under the nonconvex setting. Specifically,
we prove that BGPO achieves the sample complexity of $\tilde{O}(\epsilon^{-4})$
for finding $\epsilon$-stationary point only requiring one trajectory at each
iteration, and VR-BGPO reaches the best known sample complexity of
$\tilde{O}(\epsilon^{-3})$ for finding an $\epsilon$-stationary point, which
also only requires one trajectory at each iteration. In particular, by using
different Bregman divergences, our methods unify many existing policy
optimization algorithms and their new variants such as the existing
(variance-reduced) policy gradient algorithms and (variance-reduced) natural
policy gradient algorithms. Extensive experimental results on multiple
reinforcement learning tasks demonstrate the efficiency of our new algorithms.
Related papers
- Low-Switching Policy Gradient with Exploration via Online Sensitivity
Sampling [23.989009116398208]
We design a low-switching sample-efficient policy optimization algorithm, LPO, with general non-linear function approximation.
We show that, our algorithm obtains an $varepsilon$-optimal policy with only $widetildeO(fractextpoly(d)varepsilon3)$ samples.
arXiv Detail & Related papers (2023-06-15T23:51:46Z) - Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time
Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy.
Many algorithms for IRL have an inherently nested structure.
We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z) - Policy Optimization for Stochastic Shortest Path [43.2288319750466]
We study policy optimization for the shortest path (SSP) problem.
We propose a goal-oriented reinforcement learning model that strictly generalizes the finite-horizon model.
For most settings, our algorithm is shown to achieve a near-optimal regret bound.
arXiv Detail & Related papers (2022-02-07T16:25:14Z) - Learning Sampling Policy for Faster Derivative Free Optimization [100.27518340593284]
We propose a new reinforcement learning based ZO algorithm (ZO-RL) with learning the sampling policy for generating the perturbations in ZO optimization instead of using random sampling.
Our results show that our ZO-RL algorithm can effectively reduce the variances of ZO gradient by learning a sampling policy, and converge faster than existing ZO algorithms in different scenarios.
arXiv Detail & Related papers (2021-04-09T14:50:59Z) - On the Convergence and Sample Efficiency of Variance-Reduced Policy
Gradient Method [38.34416337932712]
Policy gives rise to a rich class of reinforcement learning (RL) methods, for example the REINFORCE.
Yet the best known sample complexity result for such methods to find an $epsilon$-optimal policy is $mathcalO(epsilon-3)$, which is suboptimal.
We study the fundamental convergence properties and sample efficiency of first-order policy optimization method.
arXiv Detail & Related papers (2021-02-17T07:06:19Z) - Momentum-Based Policy Gradient Methods [133.53164856723782]
We propose a class of efficient momentum-based policy gradient methods for the model-free reinforcement learning.
In particular, we present a non-adaptive version of IS-MBPG method, which also reaches the best known sample complexity of $O(epsilon-3)$ without any large batches.
arXiv Detail & Related papers (2020-07-13T20:44:15Z) - Stochastic Recursive Momentum for Policy Gradient Methods [28.277961340108313]
We propose a novel algorithm named STOchastic Recursive Momentum for Policy Gradient (Storm-PG)
Storm-PG enjoys a provably sharp $O (1/epsilon3)$ sample bound for STORM-PG, matching the best-known convergence rate for policy gradient algorithm.
Numerical experiments depicts the superiority of our algorithm over comparative policy gradient algorithms.
arXiv Detail & Related papers (2020-03-09T17:59:03Z) - A Hybrid Stochastic Policy Gradient Algorithm for Reinforcement Learning [32.91450388566405]
We develop a new Proximal Hybrid Policy Gradient Algorithm (ProxHSPGA)
We prove that both algorithms can achieve the best-known trajectory complexity $mathcalOleft(varepsilon-4right)$
We evaluate the performance of our algorithm on several well-known examples in reinforcement learning.
arXiv Detail & Related papers (2020-03-01T07:45:51Z) - Towards Better Understanding of Adaptive Gradient Algorithms in
Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks.
In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems.
Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z) - Provably Efficient Exploration in Policy Optimization [117.09887790160406]
This paper proposes an Optimistic variant of the Proximal Policy Optimization algorithm (OPPO)
OPPO achieves $tildeO(sqrtd2 H3 T )$ regret.
To the best of our knowledge, OPPO is the first provably efficient policy optimization algorithm that explores.
arXiv Detail & Related papers (2019-12-12T08:40:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.