On the Global Convergence of Policy Gradient in Average Reward Markov
Decision Processes
- URL: http://arxiv.org/abs/2403.06806v1
- Date: Mon, 11 Mar 2024 15:25:03 GMT
- Title: On the Global Convergence of Policy Gradient in Average Reward Markov
Decision Processes
- Authors: Navdeep Kumar, Yashaswini Murthy, Itai Shufaro, Kfir Y. Levy, R.
Srikant and Shie Mannor
- Abstract summary: We present the first finite time global convergence analysis of policy gradient in the context of average reward Markov decision processes (MDPs)
Our analysis shows that the policy gradient iterates converge to the optimal policy at a sublinear rate of $Oleft(frac1Tright),$ which translates to $Oleft(log(T)right)$ regret, where $T$ represents the number of iterations.
- Score: 50.68789924454235
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present the first finite time global convergence analysis of policy
gradient in the context of infinite horizon average reward Markov decision
processes (MDPs). Specifically, we focus on ergodic tabular MDPs with finite
state and action spaces. Our analysis shows that the policy gradient iterates
converge to the optimal policy at a sublinear rate of
$O\left({\frac{1}{T}}\right),$ which translates to $O\left({\log(T)}\right)$
regret, where $T$ represents the number of iterations. Prior work on
performance bounds for discounted reward MDPs cannot be extended to average
reward MDPs because the bounds grow proportional to the fifth power of the
effective horizon. Thus, our primary contribution is in proving that the policy
gradient algorithm converges for average-reward MDPs and in obtaining
finite-time performance guarantees. In contrast to the existing discounted
reward performance bounds, our performance bounds have an explicit dependence
on constants that capture the complexity of the underlying MDP. Motivated by
this observation, we reexamine and improve the existing performance bounds for
discounted reward MDPs. We also present simulations to empirically evaluate the
performance of average reward policy gradient algorithm.
Related papers
- Deterministic Policy Gradient Primal-Dual Methods for Continuous-Space Constrained MDPs [82.34567890576423]
We develop a deterministic policy gradient primal-dual method to find an optimal deterministic policy with non-asymptotic convergence.
We prove that the primal-dual iterates of D-PGPD converge at a sub-linear rate to an optimal regularized primal-dual pair.
To the best of our knowledge, this appears to be the first work that proposes a deterministic policy search method for continuous-space constrained MDPs.
arXiv Detail & Related papers (2024-08-19T14:11:04Z) - Learning General Parameterized Policies for Infinite Horizon Average Reward Constrained MDPs via Primal-Dual Policy Gradient Algorithm [34.593772931446125]
We propose a primal dual-based policy gradient algorithm that adeptly manages the constraints while ensuring a low regret guarantee toward achieving a global optimal policy.
Our proposed algorithm achieves $tildemathcalO(T4/5)$ objective regret and $tildemathcalO(T4/5)$ constraint violation bounds.
arXiv Detail & Related papers (2024-02-03T05:35:58Z) - Score-Aware Policy-Gradient Methods and Performance Guarantees using Local Lyapunov Conditions: Applications to Product-Form Stochastic Networks and Queueing Systems [1.747623282473278]
We introduce a policygradient method for model reinforcement learning (RL) that exploits a type of stationary distributions commonly obtained from decision processes (MDPs) in networks.
Specifically, when the stationary distribution of the MDP is parametrized by policy parameters, we can improve existing policy methods for average-reward estimation.
arXiv Detail & Related papers (2023-12-05T14:44:58Z) - Regret Analysis of Policy Gradient Algorithm for Infinite Horizon
Average Reward Markov Decision Processes [38.879933964474326]
We consider an infinite horizon average reward Markov Decision Process (MDP)
We propose a policy gradient-based algorithm and show its global convergence property.
We prove that the proposed algorithm has $tildemathcalO(T3/4)$ regret.
arXiv Detail & Related papers (2023-09-05T03:22:46Z) - Last-Iterate Convergent Policy Gradient Primal-Dual Methods for
Constrained MDPs [107.28031292946774]
We study the problem of computing an optimal policy of an infinite-horizon discounted Markov decision process (constrained MDP)
We develop two single-time-scale policy-based primal-dual algorithms with non-asymptotic convergence of their policy iterates to an optimal constrained policy.
To the best of our knowledge, this work appears to be the first non-asymptotic policy last-iterate convergence result for single-time-scale algorithms in constrained MDPs.
arXiv Detail & Related papers (2023-06-20T17:27:31Z) - Performance Bounds for Policy-Based Average Reward Reinforcement
Learning Algorithms [11.013390624382259]
Many policy-based reinforcement learning (RL) algorithms can be viewed as instantiations of approximate policy iteration (PI)
In applications where the average reward objective is the meaningful performance metric, discounted reward formulations are often used with the discount factor being close to $1,$ which is equivalent to making the expected horizon very large.
In this paper, we solve this open problem by obtaining the first finite-time error bounds for average-reward MDPs, and show that the error goes to zero in the limit as policy evaluation and policy improvement errors go to zero.
arXiv Detail & Related papers (2023-02-02T22:37:47Z) - Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time
Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy.
Many algorithms for IRL have an inherently nested structure.
We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z) - Convergence of policy gradient for entropy regularized MDPs with neural
network approximation in the mean-field regime [0.0]
We study the global convergence of policy gradient for infinite-horizon, continuous state and action space, entropy-regularized Markov decision processes (MDPs)
Our results rely on the careful analysis of non-linear Fokker--Planck--Kolmogorov equation.
arXiv Detail & Related papers (2022-01-18T20:17:16Z) - Robust and Adaptive Temporal-Difference Learning Using An Ensemble of
Gaussian Processes [70.80716221080118]
The paper takes a generative perspective on policy evaluation via temporal-difference (TD) learning.
The OS-GPTD approach is developed to estimate the value function for a given policy by observing a sequence of state-reward pairs.
To alleviate the limited expressiveness associated with a single fixed kernel, a weighted ensemble (E) of GP priors is employed to yield an alternative scheme.
arXiv Detail & Related papers (2021-12-01T23:15:09Z) - Softmax Policy Gradient Methods Can Take Exponential Time to Converge [60.98700344526674]
The softmax policy gradient (PG) method is arguably one of the de facto implementations of policy optimization in modern reinforcement learning.
We demonstrate that softmax PG methods can take exponential time -- in terms of $mathcalS|$ and $frac11-gamma$ -- to converge.
arXiv Detail & Related papers (2021-02-22T18:56:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.