Homotopic Policy Mirror Descent: Policy Convergence, Implicit
Regularization, and Improved Sample Complexity
- URL: http://arxiv.org/abs/2201.09457v3
- Date: Thu, 27 Jan 2022 17:51:12 GMT
- Title: Homotopic Policy Mirror Descent: Policy Convergence, Implicit
Regularization, and Improved Sample Complexity
- Authors: Yan Li, Tuo Zhao, Guanghui Lan
- Abstract summary: homotopic policy mirror descent (HPMD) method for solving discounted, infinite horizon MDPs with finite state and action space.
We report three properties that seem to be new in the literature of policy gradient methods.
- Score: 40.2022466644885
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We propose the homotopic policy mirror descent (HPMD) method for solving
discounted, infinite horizon MDPs with finite state and action space, and study
its policy convergence. We report three properties that seem to be new in the
literature of policy gradient methods: (1) The policy first converges linearly,
then superlinearly with order $\gamma^{-2}$ to the set of optimal policies,
after $\mathcal{O}(\log(1/\Delta^*))$ number of iterations, where $\Delta^*$ is
defined via a gap quantity associated with the optimal state-action value
function; (2) HPMD also exhibits last-iterate convergence, with the limiting
policy corresponding exactly to the optimal policy with the maximal entropy for
every state. No regularization is added to the optimization objective and hence
the second observation arises solely as an algorithmic property of the
homotopic policy gradient method. (3) For the stochastic HPMD method, we
further demonstrate a better than $\mathcal{O}(|\mathcal{S}| |\mathcal{A}| /
\epsilon^2)$ sample complexity for small optimality gap $\epsilon$, when
assuming a generative model for policy evaluation.
Related papers
- Provable Policy Gradient Methods for Average-Reward Markov Potential
Games [15.136494705127564]
We study Markov potential games under the infinite horizon average reward criterion.
We prove that both algorithms based on independent policy gradient and independent natural policy gradient converge globally to a Nash equilibrium for the average reward criterion.
arXiv Detail & Related papers (2024-03-09T00:20:33Z) - Last-Iterate Convergent Policy Gradient Primal-Dual Methods for
Constrained MDPs [107.28031292946774]
We study the problem of computing an optimal policy of an infinite-horizon discounted Markov decision process (constrained MDP)
We develop two single-time-scale policy-based primal-dual algorithms with non-asymptotic convergence of their policy iterates to an optimal constrained policy.
To the best of our knowledge, this work appears to be the first non-asymptotic policy last-iterate convergence result for single-time-scale algorithms in constrained MDPs.
arXiv Detail & Related papers (2023-06-20T17:27:31Z) - Linear Convergence of Natural Policy Gradient Methods with Log-Linear
Policies [115.86431674214282]
We consider infinite-horizon discounted Markov decision processes and study the convergence rates of the natural policy gradient (NPG) and the Q-NPG methods with the log-linear policy class.
We show that both methods attain linear convergence rates and $mathcalO (1/epsilon2)$ sample complexities using a simple, non-adaptive geometrically increasing step size.
arXiv Detail & Related papers (2022-10-04T06:17:52Z) - Best Policy Identification in Linear MDPs [70.57916977441262]
We investigate the problem of best identification in discounted linear Markov+Delta Decision in the fixed confidence setting under a generative model.
The lower bound as the solution of an intricate non- optimization program can be used as the starting point to devise such algorithms.
arXiv Detail & Related papers (2022-08-11T04:12:50Z) - Understanding the Effect of Stochasticity in Policy Optimization [86.7574122154668]
We show that the preferability of optimization methods depends critically on whether exact gradients are used.
Second, to explain these findings we introduce the concept of committal rate for policy optimization.
Third, we show that in the absence of external oracle information, there is an inherent trade-off between exploiting geometry to accelerate convergence versus achieving optimality almost surely.
arXiv Detail & Related papers (2021-10-29T06:35:44Z) - Softmax Policy Gradient Methods Can Take Exponential Time to Converge [60.98700344526674]
The softmax policy gradient (PG) method is arguably one of the de facto implementations of policy optimization in modern reinforcement learning.
We demonstrate that softmax PG methods can take exponential time -- in terms of $mathcalS|$ and $frac11-gamma$ -- to converge.
arXiv Detail & Related papers (2021-02-22T18:56:26Z) - On the Convergence and Sample Efficiency of Variance-Reduced Policy
Gradient Method [38.34416337932712]
Policy gives rise to a rich class of reinforcement learning (RL) methods, for example the REINFORCE.
Yet the best known sample complexity result for such methods to find an $epsilon$-optimal policy is $mathcalO(epsilon-3)$, which is suboptimal.
We study the fundamental convergence properties and sample efficiency of first-order policy optimization method.
arXiv Detail & Related papers (2021-02-17T07:06:19Z) - On the Global Convergence Rates of Softmax Policy Gradient Methods [45.1868906130788]
We show that with the true gradient, policy gradient with a softmax parametrization converges at a $O(e-c cdot t)$ rate.
Second, we analyze entropy regularized policy gradient and show that it enjoys a significantly faster linear convergence rate.
Third, we explain how entropy regularization improves policy optimization, even with the true gradient.
arXiv Detail & Related papers (2020-05-13T16:01:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.