Regret Analysis of Policy Gradient Algorithm for Infinite Horizon
Average Reward Markov Decision Processes
- URL: http://arxiv.org/abs/2309.01922v3
- Date: Fri, 2 Feb 2024 19:37:09 GMT
- Title: Regret Analysis of Policy Gradient Algorithm for Infinite Horizon
Average Reward Markov Decision Processes
- Authors: Qinbo Bai, Washim Uddin Mondal, Vaneet Aggarwal
- Abstract summary: We consider an infinite horizon average reward Markov Decision Process (MDP)
We propose a policy gradient-based algorithm and show its global convergence property.
We prove that the proposed algorithm has $tildemathcalO(T3/4)$ regret.
- Score: 38.879933964474326
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we consider an infinite horizon average reward Markov Decision
Process (MDP). Distinguishing itself from existing works within this context,
our approach harnesses the power of the general policy gradient-based
algorithm, liberating it from the constraints of assuming a linear MDP
structure. We propose a policy gradient-based algorithm and show its global
convergence property. We then prove that the proposed algorithm has
$\tilde{\mathcal{O}}({T}^{3/4})$ regret. Remarkably, this paper marks a
pioneering effort by presenting the first exploration into regret-bound
computation for the general parameterized policy gradient algorithm in the
context of average reward scenarios.
Related papers
- On the Global Convergence of Policy Gradient in Average Reward Markov
Decision Processes [50.68789924454235]
We present the first finite time global convergence analysis of policy gradient in the context of average reward Markov decision processes (MDPs)
Our analysis shows that the policy gradient iterates converge to the optimal policy at a sublinear rate of $Oleft(frac1Tright),$ which translates to $Oleft(log(T)right)$ regret, where $T$ represents the number of iterations.
arXiv Detail & Related papers (2024-03-11T15:25:03Z) - Learning General Parameterized Policies for Infinite Horizon Average Reward Constrained MDPs via Primal-Dual Policy Gradient Algorithm [34.593772931446125]
We propose a primal dual-based policy gradient algorithm that adeptly manages the constraints while ensuring a low regret guarantee toward achieving a global optimal policy.
Our proposed algorithm achieves $tildemathcalO(T4/5)$ objective regret and $tildemathcalO(T4/5)$ constraint violation bounds.
arXiv Detail & Related papers (2024-02-03T05:35:58Z) - A Policy Gradient Method for Confounded POMDPs [7.75007282943125]
We propose a policy gradient method for confounded partially observable Markov decision processes (POMDPs) with continuous state and observation spaces in the offline setting.
We first establish a novel identification result to non-parametrically estimate any history-dependent policy gradient under POMDPs using the offline data.
arXiv Detail & Related papers (2023-05-26T16:48:05Z) - Achieving Zero Constraint Violation for Constrained Reinforcement Learning via Conservative Natural Policy Gradient Primal-Dual Algorithm [42.83837408373223]
We consider the problem of constrained Markov decision process (CMDP) in continuous state-actions spaces.
We propose a novel Conservative Natural Policy Gradient Primal-Dual Algorithm (C-NPG-PD) to achieve zero constraint violation.
arXiv Detail & Related papers (2022-06-12T22:31:43Z) - On the Linear convergence of Natural Policy Gradient Algorithm [5.027714423258537]
Recent interest in Reinforcement Learning has motivated the study of methods inspired by optimization.
Among these is the Natural Policy Gradient, which is a mirror descent variant for MDPs.
We present improved finite time convergence bounds, and show that this algorithm has geometric convergence rate.
arXiv Detail & Related papers (2021-05-04T11:26:12Z) - Softmax Policy Gradient Methods Can Take Exponential Time to Converge [60.98700344526674]
The softmax policy gradient (PG) method is arguably one of the de facto implementations of policy optimization in modern reinforcement learning.
We demonstrate that softmax PG methods can take exponential time -- in terms of $mathcalS|$ and $frac11-gamma$ -- to converge.
arXiv Detail & Related papers (2021-02-22T18:56:26Z) - Average-Reward Off-Policy Policy Evaluation with Function Approximation [66.67075551933438]
We consider off-policy policy evaluation with function approximation in average-reward MDPs.
bootstrapping is necessary and, along with off-policy learning and FA, results in the deadly triad.
We propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting.
arXiv Detail & Related papers (2021-01-08T00:43:04Z) - Policy Gradient for Continuing Tasks in Non-stationary Markov Decision
Processes [112.38662246621969]
Reinforcement learning considers the problem of finding policies that maximize an expected cumulative reward in a Markov decision process with unknown transition probabilities.
We compute unbiased navigation gradients of the value function which we use as ascent directions to update the policy.
A major drawback of policy gradient-type algorithms is that they are limited to episodic tasks unless stationarity assumptions are imposed.
arXiv Detail & Related papers (2020-10-16T15:15:42Z) - Statistically Efficient Off-Policy Policy Gradients [80.42316902296832]
We consider the statistically efficient estimation of policy gradients from off-policy data.
We propose a meta-algorithm that achieves the lower bound without any parametric assumptions.
We establish guarantees on the rate at which we approach a stationary point when we take steps in the direction of our new estimated policy gradient.
arXiv Detail & Related papers (2020-02-10T18:41:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.