A Hybrid Stochastic Policy Gradient Algorithm for Reinforcement Learning
- URL: http://arxiv.org/abs/2003.00430v2
- Date: Mon, 21 Sep 2020 21:23:14 GMT
- Title: A Hybrid Stochastic Policy Gradient Algorithm for Reinforcement Learning
- Authors: Nhan H. Pham, Lam M. Nguyen, Dzung T. Phan, Phuong Ha Nguyen, Marten
van Dijk and Quoc Tran-Dinh
- Abstract summary: We develop a new Proximal Hybrid Policy Gradient Algorithm (ProxHSPGA)
We prove that both algorithms can achieve the best-known trajectory complexity $mathcalOleft(varepsilon-4right)$
We evaluate the performance of our algorithm on several well-known examples in reinforcement learning.
- Score: 32.91450388566405
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a novel hybrid stochastic policy gradient estimator by combining
an unbiased policy gradient estimator, the REINFORCE estimator, with another
biased one, an adapted SARAH estimator for policy optimization. The hybrid
policy gradient estimator is shown to be biased, but has variance reduced
property. Using this estimator, we develop a new Proximal Hybrid Stochastic
Policy Gradient Algorithm (ProxHSPGA) to solve a composite policy optimization
problem that allows us to handle constraints or regularizers on the policy
parameters. We first propose a single-looped algorithm then introduce a more
practical restarting variant. We prove that both algorithms can achieve the
best-known trajectory complexity $\mathcal{O}\left(\varepsilon^{-3}\right)$ to
attain a first-order stationary point for the composite problem which is better
than existing REINFORCE/GPOMDP $\mathcal{O}\left(\varepsilon^{-4}\right)$ and
SVRPG $\mathcal{O}\left(\varepsilon^{-10/3}\right)$ in the non-composite
setting. We evaluate the performance of our algorithm on several well-known
examples in reinforcement learning. Numerical results show that our algorithm
outperforms two existing methods on these examples. Moreover, the composite
settings indeed have some advantages compared to the non-composite ones on
certain problems.
Related papers
- Low-Switching Policy Gradient with Exploration via Online Sensitivity
Sampling [23.989009116398208]
We design a low-switching sample-efficient policy optimization algorithm, LPO, with general non-linear function approximation.
We show that, our algorithm obtains an $varepsilon$-optimal policy with only $widetildeO(fractextpoly(d)varepsilon3)$ samples.
arXiv Detail & Related papers (2023-06-15T23:51:46Z) - Stochastic Policy Gradient Methods: Improved Sample Complexity for
Fisher-non-degenerate Policies [19.779044926914704]
We develop improved global convergence guarantees for a general class of Fisher-non-degenerate parameterized policies.
In this work, we propose a Normalized Policy Gradient method with Implicit Gradient Transport (N-PG-IGT) and derive a $tildemathcalO(varepsilon-2.5)$ sample complexity of this method.
We further improve this complexity to $tilde mathcalmathcalO (varepsilon-2)$ by considering a Hessian-Aided Recursive Policy Gradient
arXiv Detail & Related papers (2023-02-03T13:50:23Z) - Stochastic Dimension-reduced Second-order Methods for Policy
Optimization [11.19708535159457]
We propose several new second-order algorithms for policy optimization that only require gradient and Hessian-vector product in each iteration.
Specifically, we propose a dimension-reduced second-order method (DR-SOPO) which repeatedly solves a projected two-dimensional trust region subproblem.
We show that DR-SOPO obtains an $mathcalO(epsilon-3.5)$ complexity for reaching approximate first-order stationary condition.
In addition, we present an enhanced algorithm (DVR-SOPO) which further improves the complexity to $mathcalO
arXiv Detail & Related papers (2023-01-28T12:09:58Z) - Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time
Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy.
Many algorithms for IRL have an inherently nested structure.
We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z) - Momentum Accelerates the Convergence of Stochastic AUPRC Maximization [80.8226518642952]
We study optimization of areas under precision-recall curves (AUPRC), which is widely used for imbalanced tasks.
We develop novel momentum methods with a better iteration of $O (1/epsilon4)$ for finding an $epsilon$stationary solution.
We also design a novel family of adaptive methods with the same complexity of $O (1/epsilon4)$, which enjoy faster convergence in practice.
arXiv Detail & Related papers (2021-07-02T16:21:52Z) - Bregman Gradient Policy Optimization [97.73041344738117]
We design a Bregman gradient policy optimization for reinforcement learning based on Bregman divergences and momentum techniques.
VR-BGPO reaches the best complexity $tilde(epsilon-3)$ for finding an $epsilon$stationary point only requiring one trajectory at each iteration.
arXiv Detail & Related papers (2021-06-23T01:08:54Z) - Learning Sampling Policy for Faster Derivative Free Optimization [100.27518340593284]
We propose a new reinforcement learning based ZO algorithm (ZO-RL) with learning the sampling policy for generating the perturbations in ZO optimization instead of using random sampling.
Our results show that our ZO-RL algorithm can effectively reduce the variances of ZO gradient by learning a sampling policy, and converge faster than existing ZO algorithms in different scenarios.
arXiv Detail & Related papers (2021-04-09T14:50:59Z) - On the Convergence and Sample Efficiency of Variance-Reduced Policy
Gradient Method [38.34416337932712]
Policy gives rise to a rich class of reinforcement learning (RL) methods, for example the REINFORCE.
Yet the best known sample complexity result for such methods to find an $epsilon$-optimal policy is $mathcalO(epsilon-3)$, which is suboptimal.
We study the fundamental convergence properties and sample efficiency of first-order policy optimization method.
arXiv Detail & Related papers (2021-02-17T07:06:19Z) - Average-Reward Off-Policy Policy Evaluation with Function Approximation [66.67075551933438]
We consider off-policy policy evaluation with function approximation in average-reward MDPs.
bootstrapping is necessary and, along with off-policy learning and FA, results in the deadly triad.
We propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting.
arXiv Detail & Related papers (2021-01-08T00:43:04Z) - Private Stochastic Non-Convex Optimization: Adaptive Algorithms and
Tighter Generalization Bounds [72.63031036770425]
We propose differentially private (DP) algorithms for bound non-dimensional optimization.
We demonstrate two popular deep learning methods on the empirical advantages over standard gradient methods.
arXiv Detail & Related papers (2020-06-24T06:01:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.