Provable Policy Gradient Methods for Average-Reward Markov Potential
Games
- URL: http://arxiv.org/abs/2403.05738v1
- Date: Sat, 9 Mar 2024 00:20:33 GMT
- Title: Provable Policy Gradient Methods for Average-Reward Markov Potential
Games
- Authors: Min Cheng, Ruida Zhou, P. R. Kumar and Chao Tian
- Abstract summary: We study Markov potential games under the infinite horizon average reward criterion.
We prove that both algorithms based on independent policy gradient and independent natural policy gradient converge globally to a Nash equilibrium for the average reward criterion.
- Score: 15.136494705127564
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We study Markov potential games under the infinite horizon average reward
criterion. Most previous studies have been for discounted rewards. We prove
that both algorithms based on independent policy gradient and independent
natural policy gradient converge globally to a Nash equilibrium for the average
reward criterion. To set the stage for gradient-based methods, we first
establish that the average reward is a smooth function of policies and provide
sensitivity bounds for the differential value functions, under certain
conditions on ergodicity and the second largest eigenvalue of the underlying
Markov decision process (MDP). We prove that three algorithms, policy gradient,
proximal-Q, and natural policy gradient (NPG), converge to an $\epsilon$-Nash
equilibrium with time complexity $O(\frac{1}{\epsilon^2})$, given a
gradient/differential Q function oracle. When policy gradients have to be
estimated, we propose an algorithm with
$\tilde{O}(\frac{1}{\min_{s,a}\pi(a|s)\delta})$ sample complexity to achieve
$\delta$ approximation error w.r.t~the $\ell_2$ norm. Equipped with the
estimator, we derive the first sample complexity analysis for a policy gradient
ascent algorithm, featuring a sample complexity of $\tilde{O}(1/\epsilon^5)$.
Simulation studies are presented.
Related papers
- Convergence of a model-free entropy-regularized inverse reinforcement learning algorithm [6.481009996429766]
inverse reinforcement learning (IRL) aims to recover a reward for which the expert is optimal.
This work proposes a model-free algorithm to solve the entropy-regularized IRL problem.
arXiv Detail & Related papers (2024-03-25T14:54:42Z) - Improved Sample Complexity Analysis of Natural Policy Gradient Algorithm
with General Parameterization for Infinite Horizon Discounted Reward Markov
Decision Processes [41.61653528766776]
We propose the Natural Accelerated Policy Gradient (PGAN) algorithm that utilizes an accelerated gradient descent process to obtain the natural policy gradient.
An iteration achieves $mathcalO(epsilon-2)$ sample complexity and $mathcalO(epsilon-1)$ complexity.
In the class of Hessian-free and IS-free algorithms, ANPG beats the best-known sample complexity by a factor of $mathcalO(epsilon-frac12)$ and simultaneously matches their state-of
arXiv Detail & Related papers (2023-10-18T03:00:15Z) - Optimal Horizon-Free Reward-Free Exploration for Linear Mixture MDPs [60.40452803295326]
We propose a new reward-free algorithm for learning linear Markov decision processes (MDPs)
At the core of our algorithm is uncertainty-weighted value-targeted regression with exploration-driven pseudo-reward.
We show that our algorithm only needs to explore $tilde O( d2varepsilon-2)$ episodes to find an $varepsilon$-optimal policy.
arXiv Detail & Related papers (2023-03-17T17:53:28Z) - Stochastic Policy Gradient Methods: Improved Sample Complexity for
Fisher-non-degenerate Policies [19.779044926914704]
We develop improved global convergence guarantees for a general class of Fisher-non-degenerate parameterized policies.
In this work, we propose a Normalized Policy Gradient method with Implicit Gradient Transport (N-PG-IGT) and derive a $tildemathcalO(varepsilon-2.5)$ sample complexity of this method.
We further improve this complexity to $tilde mathcalmathcalO (varepsilon-2)$ by considering a Hessian-Aided Recursive Policy Gradient
arXiv Detail & Related papers (2023-02-03T13:50:23Z) - Best Policy Identification in Linear MDPs [70.57916977441262]
We investigate the problem of best identification in discounted linear Markov+Delta Decision in the fixed confidence setting under a generative model.
The lower bound as the solution of an intricate non- optimization program can be used as the starting point to devise such algorithms.
arXiv Detail & Related papers (2022-08-11T04:12:50Z) - Homotopic Policy Mirror Descent: Policy Convergence, Implicit
Regularization, and Improved Sample Complexity [40.2022466644885]
homotopic policy mirror descent (HPMD) method for solving discounted, infinite horizon MDPs with finite state and action space.
We report three properties that seem to be new in the literature of policy gradient methods.
arXiv Detail & Related papers (2022-01-24T04:54:58Z) - A New Framework for Variance-Reduced Hamiltonian Monte Carlo [88.84622104944503]
We propose a new framework of variance-reduced Hamiltonian Monte Carlo (HMC) methods for sampling from an $L$-smooth and $m$-strongly log-concave distribution.
We show that the unbiased gradient estimators, including SAGA and SVRG, based HMC methods achieve highest gradient efficiency with small batch size.
Experimental results on both synthetic and real-world benchmark data show that our new framework significantly outperforms the full gradient and gradient HMC approaches.
arXiv Detail & Related papers (2021-02-09T02:44:24Z) - Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and
Variance Reduction [63.41789556777387]
Asynchronous Q-learning aims to learn the optimal action-value function (or Q-function) of a Markov decision process (MDP)
We show that the number of samples needed to yield an entrywise $varepsilon$-accurate estimate of the Q-function is at most on the order of $frac1mu_min (1-gamma)5varepsilon2+ fract_mixmu_min (1-gamma)$ up to some logarithmic factor.
arXiv Detail & Related papers (2020-06-04T17:51:00Z) - On the Global Convergence Rates of Softmax Policy Gradient Methods [45.1868906130788]
We show that with the true gradient, policy gradient with a softmax parametrization converges at a $O(e-c cdot t)$ rate.
Second, we analyze entropy regularized policy gradient and show that it enjoys a significantly faster linear convergence rate.
Third, we explain how entropy regularization improves policy optimization, even with the true gradient.
arXiv Detail & Related papers (2020-05-13T16:01:39Z) - Reward-Free Exploration for Reinforcement Learning [82.3300753751066]
We propose a new "reward-free RL" framework to isolate the challenges of exploration.
We give an efficient algorithm that conducts $tildemathcalO(S2Amathrmpoly(H)/epsilon2)$ episodes of exploration.
We also give a nearly-matching $Omega(S2AH2/epsilon2)$ lower bound, demonstrating the near-optimality of our algorithm in this setting.
arXiv Detail & Related papers (2020-02-07T14:03:38Z) - Towards Better Understanding of Adaptive Gradient Algorithms in
Generative Adversarial Nets [71.05306664267832]
Adaptive algorithms perform gradient updates using the history of gradients and are ubiquitous in training deep neural networks.
In this paper we analyze a variant of OptimisticOA algorithm for nonconcave minmax problems.
Our experiments show that adaptive GAN non-adaptive gradient algorithms can be observed empirically.
arXiv Detail & Related papers (2019-12-26T22:10:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.