Landscape of Policy Optimization for Finite Horizon MDPs with General State and Action
- URL: http://arxiv.org/abs/2409.17138v1
- Date: Wed, 25 Sep 2024 17:56:02 GMT
- Title: Landscape of Policy Optimization for Finite Horizon MDPs with General State and Action
- Authors: Xin Chen, Yifan Hu, Minda Zhao,
- Abstract summary: We develop a framework for a class of Markov Decision Processes with general state and spaces.
We show that gradient methods converge to the globally optimal policy with a nonasymptomatic condition.
Our result establishes first complexity for multi-period inventory systems.
- Score: 10.219627570276689
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Policy gradient methods are widely used in reinforcement learning. Yet, the nonconvexity of policy optimization imposes significant challenges in understanding the global convergence of policy gradient methods. For a class of finite-horizon Markov Decision Processes (MDPs) with general state and action spaces, we develop a framework that provides a set of easily verifiable assumptions to ensure the Kurdyka-Lojasiewicz (KL) condition of the policy optimization. Leveraging the KL condition, policy gradient methods converge to the globally optimal policy with a non-asymptomatic rate despite nonconvexity. Our results find applications in various control and operations models, including entropy-regularized tabular MDPs, Linear Quadratic Regulator (LQR) problems, stochastic inventory models, and stochastic cash balance problems, for which we show an $\epsilon$-optimal policy can be obtained using a sample size in $\tilde{\mathcal{O}}(\epsilon^{-1})$ and polynomial in terms of the planning horizon by stochastic policy gradient methods. Our result establishes the first sample complexity for multi-period inventory systems with Markov-modulated demands and stochastic cash balance problems in the literature.
Related papers
- Near-Optimal Policy Identification in Robust Constrained Markov Decision Processes via Epigraph Form [26.01796404477275]
This paper presents the first algorithm capable of identifying a near-optimal policy in a robust constrained MDP (RCMDP)
An optimal policy minimizes cumulative cost while satisfying constraints in the worst-case scenario across a set of environments.
arXiv Detail & Related papers (2024-08-29T06:37:16Z) - Deterministic Policy Gradient Primal-Dual Methods for Continuous-Space Constrained MDPs [82.34567890576423]
We develop a deterministic policy gradient primal-dual method to find an optimal deterministic policy with non-asymptotic convergence.
We prove that the primal-dual iterates of D-PGPD converge at a sub-linear rate to an optimal regularized primal-dual pair.
To the best of our knowledge, this appears to be the first work that proposes a deterministic policy search method for continuous-space constrained MDPs.
arXiv Detail & Related papers (2024-08-19T14:11:04Z) - Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning [62.81324245896717]
We introduce an exploration-agnostic algorithm, called C-PG, which exhibits global last-ite convergence guarantees under (weak) gradient domination assumptions.
We numerically validate our algorithms on constrained control problems, and compare them with state-of-the-art baselines.
arXiv Detail & Related papers (2024-07-15T14:54:57Z) - Beyond Stationarity: Convergence Analysis of Stochastic Softmax Policy Gradient Methods [0.40964539027092917]
Markov Decision Processes (MDPs) are a formal framework for modeling and solving sequential decision-making problems.
In practice all parameters are often trained simultaneously, ignoring the inherent structure suggested by dynamic programming.
This paper introduces a combination of dynamic programming and policy gradient called dynamic policy gradient, where the parameters are trained backwards in time.
arXiv Detail & Related papers (2023-10-04T09:21:01Z) - Last-Iterate Convergent Policy Gradient Primal-Dual Methods for
Constrained MDPs [107.28031292946774]
We study the problem of computing an optimal policy of an infinite-horizon discounted Markov decision process (constrained MDP)
We develop two single-time-scale policy-based primal-dual algorithms with non-asymptotic convergence of their policy iterates to an optimal constrained policy.
To the best of our knowledge, this work appears to be the first non-asymptotic policy last-iterate convergence result for single-time-scale algorithms in constrained MDPs.
arXiv Detail & Related papers (2023-06-20T17:27:31Z) - Policy Gradient for Rectangular Robust Markov Decision Processes [62.397882389472564]
We introduce robust policy gradient (RPG), a policy-based method that efficiently solves rectangular robust Markov decision processes (MDPs)
Our resulting RPG can be estimated from data with the same time complexity as its non-robust equivalent.
arXiv Detail & Related papers (2023-01-31T12:40:50Z) - Multi-Objective Policy Gradients with Topological Constraints [108.10241442630289]
We present a new algorithm for a policy gradient in TMDPs by a simple extension of the proximal policy optimization (PPO) algorithm.
We demonstrate this on a real-world multiple-objective navigation problem with an arbitrary ordering of objectives both in simulation and on a real robot.
arXiv Detail & Related papers (2022-09-15T07:22:58Z) - Anchor-Changing Regularized Natural Policy Gradient for Multi-Objective
Reinforcement Learning [17.916366827429034]
We study policy optimization for Markov decision processes (MDPs) with multiple reward value functions.
We propose an Anchor-changing Regularized Natural Policy Gradient framework, which can incorporate ideas from well-performing first-order methods.
arXiv Detail & Related papers (2022-06-10T21:09:44Z) - MPC-based Reinforcement Learning for Economic Problems with Application
to Battery Storage [0.0]
We focus on policy approximations based on Model Predictive Control (MPC)
We observe that the policy gradient method can struggle to produce meaningful steps in the policy parameters when the policy has a (nearly) bang-bang structure.
We propose a homotopy strategy based on the interior-point method, providing a relaxation of the policy during the learning.
arXiv Detail & Related papers (2021-04-06T10:37:14Z) - Softmax Policy Gradient Methods Can Take Exponential Time to Converge [60.98700344526674]
The softmax policy gradient (PG) method is arguably one of the de facto implementations of policy optimization in modern reinforcement learning.
We demonstrate that softmax PG methods can take exponential time -- in terms of $mathcalS|$ and $frac11-gamma$ -- to converge.
arXiv Detail & Related papers (2021-02-22T18:56:26Z) - Policy Gradient Methods for the Noisy Linear Quadratic Regulator over a
Finite Horizon [3.867363075280544]
We explore reinforcement learning methods for finding the optimal policy in the linear quadratic regulator (LQR) problem.
We produce a global linear convergence guarantee for the setting of finite time horizon and state dynamics under weak assumptions.
We show results for the case where we assume a model for the underlying dynamics and where we apply the method to the data directly.
arXiv Detail & Related papers (2020-11-20T09:51:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.