Policy Gradient in Robust MDPs with Global Convergence Guarantee
- URL: http://arxiv.org/abs/2212.10439v2
- Date: Wed, 7 Jun 2023 12:20:17 GMT
- Title: Policy Gradient in Robust MDPs with Global Convergence Guarantee
- Authors: Qiuhao Wang, Chin Pang Ho, Marek Petrik
- Abstract summary: Robust Markov decision processes (RMDPs) provide a promising framework for computing reliable policies in the face of model errors.
This paper proposes a new Double-Loop Robust Policy Gradient (DRPG), the first generic policy gradient method for RMDPs.
In contrast with prior robust policy gradient algorithms, DRPG monotonically reduces approximation errors to guarantee convergence to a globally optimal policy.
- Score: 13.40471012593073
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Robust Markov decision processes (RMDPs) provide a promising framework for
computing reliable policies in the face of model errors. Many successful
reinforcement learning algorithms build on variations of policy-gradient
methods, but adapting these methods to RMDPs has been challenging. As a result,
the applicability of RMDPs to large, practical domains remains limited. This
paper proposes a new Double-Loop Robust Policy Gradient (DRPG), the first
generic policy gradient method for RMDPs. In contrast with prior robust policy
gradient algorithms, DRPG monotonically reduces approximation errors to
guarantee convergence to a globally optimal policy in tabular RMDPs. We
introduce a novel parametric transition kernel and solve the inner loop robust
policy via a gradient-based method. Finally, our numerical results demonstrate
the utility of our new algorithm and confirm its global convergence properties.
Related papers
- Policy Gradient for Robust Markov Decision Processes [16.281897051782863]
This paper introduces a novel policy gradient method, Double-Loop Robust Policy Mirror Descent (MD), for solving robust Markov Decision Processes (MDPs)
MD employs a general mirror descent update rule for the policy optimization with adaptive tolerance per iteration, guaranteeing convergence to a globally optimal policy.
We provide a comprehensive analysis of MD, including new convergence results under both direct and softmax parameterizations, and provide novel insights into the inner problem solution through Transition Mirror Ascent (TMA)
arXiv Detail & Related papers (2024-10-29T15:16:02Z) - Near-Optimal Policy Identification in Robust Constrained Markov Decision Processes via Epigraph Form [26.01796404477275]
This paper presents the first algorithm capable of identifying a near-optimal policy in a robust constrained MDP (RCMDP)
An optimal policy minimizes cumulative cost while satisfying constraints in the worst-case scenario across a set of environments.
arXiv Detail & Related papers (2024-08-29T06:37:16Z) - Deterministic Policy Gradient Primal-Dual Methods for Continuous-Space Constrained MDPs [82.34567890576423]
We develop a deterministic policy gradient primal-dual method to find an optimal deterministic policy with non-asymptotic convergence.
We prove that the primal-dual iterates of D-PGPD converge at a sub-linear rate to an optimal regularized primal-dual pair.
To the best of our knowledge, this appears to be the first work that proposes a deterministic policy search method for continuous-space constrained MDPs.
arXiv Detail & Related papers (2024-08-19T14:11:04Z) - Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning [62.81324245896717]
We introduce an exploration-agnostic algorithm, called C-PG, which exhibits global last-ite convergence guarantees under (weak) gradient domination assumptions.
We numerically validate our algorithms on constrained control problems, and compare them with state-of-the-art baselines.
arXiv Detail & Related papers (2024-07-15T14:54:57Z) - On the Global Convergence of Policy Gradient in Average Reward Markov
Decision Processes [50.68789924454235]
We present the first finite time global convergence analysis of policy gradient in the context of average reward Markov decision processes (MDPs)
Our analysis shows that the policy gradient iterates converge to the optimal policy at a sublinear rate of $Oleft(frac1Tright),$ which translates to $Oleft(log(T)right)$ regret, where $T$ represents the number of iterations.
arXiv Detail & Related papers (2024-03-11T15:25:03Z) - Last-Iterate Convergent Policy Gradient Primal-Dual Methods for
Constrained MDPs [107.28031292946774]
We study the problem of computing an optimal policy of an infinite-horizon discounted Markov decision process (constrained MDP)
We develop two single-time-scale policy-based primal-dual algorithms with non-asymptotic convergence of their policy iterates to an optimal constrained policy.
To the best of our knowledge, this work appears to be the first non-asymptotic policy last-iterate convergence result for single-time-scale algorithms in constrained MDPs.
arXiv Detail & Related papers (2023-06-20T17:27:31Z) - Multi-Objective Policy Gradients with Topological Constraints [108.10241442630289]
We present a new algorithm for a policy gradient in TMDPs by a simple extension of the proximal policy optimization (PPO) algorithm.
We demonstrate this on a real-world multiple-objective navigation problem with an arbitrary ordering of objectives both in simulation and on a real robot.
arXiv Detail & Related papers (2022-09-15T07:22:58Z) - Anchor-Changing Regularized Natural Policy Gradient for Multi-Objective
Reinforcement Learning [17.916366827429034]
We study policy optimization for Markov decision processes (MDPs) with multiple reward value functions.
We propose an Anchor-changing Regularized Natural Policy Gradient framework, which can incorporate ideas from well-performing first-order methods.
arXiv Detail & Related papers (2022-06-10T21:09:44Z) - Stochastic first-order methods for average-reward Markov decision processes [10.023632561462712]
We study average-reward Markov decision processes (AMDPs) and develop novel first-order methods with strong theoretical guarantees for both policy optimization and policy evaluation.
By combining the policy evaluation and policy optimization parts, we establish sample complexity results for solving AMDPs under both generative and Markovian noise models.
arXiv Detail & Related papers (2022-05-11T23:02:46Z) - Softmax Policy Gradient Methods Can Take Exponential Time to Converge [60.98700344526674]
The softmax policy gradient (PG) method is arguably one of the de facto implementations of policy optimization in modern reinforcement learning.
We demonstrate that softmax PG methods can take exponential time -- in terms of $mathcalS|$ and $frac11-gamma$ -- to converge.
arXiv Detail & Related papers (2021-02-22T18:56:26Z) - Fast Global Convergence of Natural Policy Gradient Methods with Entropy
Regularization [44.24881971917951]
Natural policy gradient (NPG) methods are among the most widely used policy optimization algorithms.
We develop convergence guarantees for entropy-regularized NPG methods under softmax parameterization.
Our results accommodate a wide range of learning rates, and shed light upon the role of entropy regularization in enabling fast convergence.
arXiv Detail & Related papers (2020-07-13T17:58:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.