Last-Iterate Convergent Policy Gradient Primal-Dual Methods for
Constrained MDPs
- URL: http://arxiv.org/abs/2306.11700v2
- Date: Wed, 17 Jan 2024 04:52:39 GMT
- Title: Last-Iterate Convergent Policy Gradient Primal-Dual Methods for
Constrained MDPs
- Authors: Dongsheng Ding and Chen-Yu Wei and Kaiqing Zhang and Alejandro Ribeiro
- Abstract summary: We study the problem of computing an optimal policy of an infinite-horizon discounted Markov decision process (constrained MDP)
We develop two single-time-scale policy-based primal-dual algorithms with non-asymptotic convergence of their policy iterates to an optimal constrained policy.
To the best of our knowledge, this work appears to be the first non-asymptotic policy last-iterate convergence result for single-time-scale algorithms in constrained MDPs.
- Score: 107.28031292946774
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We study the problem of computing an optimal policy of an infinite-horizon
discounted constrained Markov decision process (constrained MDP). Despite the
popularity of Lagrangian-based policy search methods used in practice, the
oscillation of policy iterates in these methods has not been fully understood,
bringing out issues such as violation of constraints and sensitivity to
hyper-parameters. To fill this gap, we employ the Lagrangian method to cast a
constrained MDP into a constrained saddle-point problem in which max/min
players correspond to primal/dual variables, respectively, and develop two
single-time-scale policy-based primal-dual algorithms with non-asymptotic
convergence of their policy iterates to an optimal constrained policy.
Specifically, we first propose a regularized policy gradient primal-dual
(RPG-PD) method that updates the policy using an entropy-regularized policy
gradient, and the dual variable via a quadratic-regularized gradient ascent,
simultaneously. We prove that the policy primal-dual iterates of RPG-PD
converge to a regularized saddle point with a sublinear rate, while the policy
iterates converge sublinearly to an optimal constrained policy. We further
instantiate RPG-PD in large state or action spaces by including function
approximation in policy parametrization, and establish similar sublinear
last-iterate policy convergence. Second, we propose an optimistic policy
gradient primal-dual (OPG-PD) method that employs the optimistic gradient
method to update primal/dual variables, simultaneously. We prove that the
policy primal-dual iterates of OPG-PD converge to a saddle point that contains
an optimal constrained policy, with a linear rate. To the best of our
knowledge, this work appears to be the first non-asymptotic policy last-iterate
convergence result for single-time-scale algorithms in constrained MDPs.
Related papers
Err
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.