On the Global Convergence Rates of Softmax Policy Gradient Methods
- URL: http://arxiv.org/abs/2005.06392v3
- Date: Thu, 2 Jun 2022 06:16:17 GMT
- Title: On the Global Convergence Rates of Softmax Policy Gradient Methods
- Authors: Jincheng Mei, Chenjun Xiao, Csaba Szepesvari, Dale Schuurmans
- Abstract summary: We show that with the true gradient, policy gradient with a softmax parametrization converges at a $O(e-c cdot t)$ rate.
Second, we analyze entropy regularized policy gradient and show that it enjoys a significantly faster linear convergence rate.
Third, we explain how entropy regularization improves policy optimization, even with the true gradient.
- Score: 45.1868906130788
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We make three contributions toward better understanding policy gradient
methods in the tabular setting. First, we show that with the true gradient,
policy gradient with a softmax parametrization converges at a $O(1/t)$ rate,
with constants depending on the problem and initialization. This result
significantly expands the recent asymptotic convergence results. The analysis
relies on two findings: that the softmax policy gradient satisfies a
\L{}ojasiewicz inequality, and the minimum probability of an optimal action
during optimization can be bounded in terms of its initial value. Second, we
analyze entropy regularized policy gradient and show that it enjoys a
significantly faster linear convergence rate $O(e^{-c \cdot t})$ toward softmax
optimal policy $(c > 0)$. This result resolves an open question in the recent
literature. Finally, combining the above two results and additional new
$\Omega(1/t)$ lower bound results, we explain how entropy regularization
improves policy optimization, even with the true gradient, from the perspective
of convergence rate. The separation of rates is further explained using the
notion of non-uniform \L{}ojasiewicz degree. These results provide a
theoretical understanding of the impact of entropy and corroborate existing
empirical studies.
Related papers
- Provable Policy Gradient Methods for Average-Reward Markov Potential
Games [15.136494705127564]
We study Markov potential games under the infinite horizon average reward criterion.
We prove that both algorithms based on independent policy gradient and independent natural policy gradient converge globally to a Nash equilibrium for the average reward criterion.
arXiv Detail & Related papers (2024-03-09T00:20:33Z) - The Role of Baselines in Policy Gradient Optimization [83.42050606055822]
We show that the emphstate value baseline allows on-policy.
emphnatural policy gradient (NPG) to converge to a globally optimal.
policy at an $O (1/t) rate gradient.
We find that the primary effect of the value baseline is to textbfreduce the aggressiveness of the updates rather than their variance.
arXiv Detail & Related papers (2023-01-16T06:28:00Z) - Min-Max Optimization Made Simple: Approximating the Proximal Point
Method via Contraction Maps [77.8999425439444]
We present a first-order method that admits near-optimal convergence rates for convex/concave min-max problems.
Our work is based on the fact that the update rule of the Proximal Point method can be approximated up to accuracy.
arXiv Detail & Related papers (2023-01-10T12:18:47Z) - Convergence of First-Order Methods for Constrained Nonconvex
Optimization with Dependent Data [7.513100214864646]
We show the worst-case complexity of convergence $tildeO(t-1/4)$ and MoreautildeO(vareps-4)$ for smooth non- optimization.
We obtain first online nonnegative matrix factorization algorithms for dependent data based on projected gradient methods with adaptive step sizes and optimal convergence.
arXiv Detail & Related papers (2022-03-29T17:59:10Z) - Homotopic Policy Mirror Descent: Policy Convergence, Implicit
Regularization, and Improved Sample Complexity [40.2022466644885]
homotopic policy mirror descent (HPMD) method for solving discounted, infinite horizon MDPs with finite state and action space.
We report three properties that seem to be new in the literature of policy gradient methods.
arXiv Detail & Related papers (2022-01-24T04:54:58Z) - Understanding the Effect of Stochasticity in Policy Optimization [86.7574122154668]
We show that the preferability of optimization methods depends critically on whether exact gradients are used.
Second, to explain these findings we introduce the concept of committal rate for policy optimization.
Third, we show that in the absence of external oracle information, there is an inherent trade-off between exploiting geometry to accelerate convergence versus achieving optimality almost surely.
arXiv Detail & Related papers (2021-10-29T06:35:44Z) - Softmax Policy Gradient Methods Can Take Exponential Time to Converge [60.98700344526674]
The softmax policy gradient (PG) method is arguably one of the de facto implementations of policy optimization in modern reinforcement learning.
We demonstrate that softmax PG methods can take exponential time -- in terms of $mathcalS|$ and $frac11-gamma$ -- to converge.
arXiv Detail & Related papers (2021-02-22T18:56:26Z) - On the Convergence and Sample Efficiency of Variance-Reduced Policy
Gradient Method [38.34416337932712]
Policy gives rise to a rich class of reinforcement learning (RL) methods, for example the REINFORCE.
Yet the best known sample complexity result for such methods to find an $epsilon$-optimal policy is $mathcalO(epsilon-3)$, which is suboptimal.
We study the fundamental convergence properties and sample efficiency of first-order policy optimization method.
arXiv Detail & Related papers (2021-02-17T07:06:19Z) - Towards Better Understanding of Adaptive Gradient Algorithms in
Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks.
In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems.
Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.