Homotopic Policy Mirror Descent: Policy Convergence, Implicit
Regularization, and Improved Sample Complexity
- URL: http://arxiv.org/abs/2201.09457v3
- Date: Thu, 27 Jan 2022 17:51:12 GMT
- Title: Homotopic Policy Mirror Descent: Policy Convergence, Implicit
Regularization, and Improved Sample Complexity
- Authors: Yan Li, Tuo Zhao, Guanghui Lan
- Abstract summary: homotopic policy mirror descent (HPMD) method for solving discounted, infinite horizon MDPs with finite state and action space.
We report three properties that seem to be new in the literature of policy gradient methods.
- Score: 40.2022466644885
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We propose the homotopic policy mirror descent (HPMD) method for solving
discounted, infinite horizon MDPs with finite state and action space, and study
its policy convergence. We report three properties that seem to be new in the
literature of policy gradient methods: (1) The policy first converges linearly,
then superlinearly with order $\gamma^{-2}$ to the set of optimal policies,
after $\mathcal{O}(\log(1/\Delta^*))$ number of iterations, where $\Delta^*$ is
defined via a gap quantity associated with the optimal state-action value
function; (2) HPMD also exhibits last-iterate convergence, with the limiting
policy corresponding exactly to the optimal policy with the maximal entropy for
every state. No regularization is added to the optimization objective and hence
the second observation arises solely as an algorithmic property of the
homotopic policy gradient method. (3) For the stochastic HPMD method, we
further demonstrate a better than $\mathcal{O}(|\mathcal{S}| |\mathcal{A}| /
\epsilon^2)$ sample complexity for small optimality gap $\epsilon$, when
assuming a generative model for policy evaluation.
Related papers
Err
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.