Convergence and Price of Anarchy Guarantees of the Softmax Policy
Gradient in Markov Potential Games
- URL: http://arxiv.org/abs/2206.07642v1
- Date: Wed, 15 Jun 2022 16:41:06 GMT
- Title: Convergence and Price of Anarchy Guarantees of the Softmax Policy
Gradient in Markov Potential Games
- Authors: Dingyang Chen, Qi Zhang, Thinh T. Doan
- Abstract summary: We study the performance of policy gradient methods for the subclass of Markov potential games (MPGs)
We extend the notion of price of anarchy (POA) and smoothness in normal-form games to solve MPGs.
To our knowledge, this is the first POA bound for solving MPGs.
- Score: 7.878934648314757
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study the performance of policy gradient methods for the subclass of
Markov games known as Markov potential games (MPGs), which extends the notion
of normal-form potential games to the stateful setting and includes the
important special case of the fully cooperative setting where the agents share
an identical reward function. Our focus in this paper is to study the
convergence of the policy gradient method for solving MPGs under softmax policy
parameterization, both tabular and parameterized with general function
approximators such as neural networks. We first show the asymptotic convergence
of this method to a Nash equilibrium of MPGs for tabular softmax policies.
Second, we derive the finite-time performance of the policy gradient in two
settings: 1) using the log-barrier regularization, and 2) using the natural
policy gradient under the best-response dynamics (NPG-BR). Finally, extending
the notion of price of anarchy (POA) and smoothness in normal-form games, we
introduce the POA for MPGs and provide a POA bound for NPG-BR. To our
knowledge, this is the first POA bound for solving MPGs. To support our
theoretical results, we empirically compare the convergence rates and POA of
policy gradient variants for both tabular and neural softmax policies.
Related papers
- Structure Matters: Dynamic Policy Gradient [1.747623282473278]
We introduce a framework called dynamic policy gradient (DynPG)
DynPG directly integrates dynamic programming with (any) policy gradient method.
Our findings contrast recent lower bound examples for vanilla policy gradient.
arXiv Detail & Related papers (2024-11-07T17:51:55Z) - Deterministic Policy Gradient Primal-Dual Methods for Continuous-Space Constrained MDPs [82.34567890576423]
We develop a deterministic policy gradient primal-dual method to find an optimal deterministic policy with non-asymptotic convergence.
We prove that the primal-dual iterates of D-PGPD converge at a sub-linear rate to an optimal regularized primal-dual pair.
To the best of our knowledge, this appears to be the first work that proposes a deterministic policy search method for continuous-space constrained MDPs.
arXiv Detail & Related papers (2024-08-19T14:11:04Z) - Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning [62.81324245896717]
We introduce an exploration-agnostic algorithm, called C-PG, which exhibits global last-ite convergence guarantees under (weak) gradient domination assumptions.
We numerically validate our algorithms on constrained control problems, and compare them with state-of-the-art baselines.
arXiv Detail & Related papers (2024-07-15T14:54:57Z) - Last-Iterate Convergent Policy Gradient Primal-Dual Methods for
Constrained MDPs [107.28031292946774]
We study the problem of computing an optimal policy of an infinite-horizon discounted Markov decision process (constrained MDP)
We develop two single-time-scale policy-based primal-dual algorithms with non-asymptotic convergence of their policy iterates to an optimal constrained policy.
To the best of our knowledge, this work appears to be the first non-asymptotic policy last-iterate convergence result for single-time-scale algorithms in constrained MDPs.
arXiv Detail & Related papers (2023-06-20T17:27:31Z) - Matryoshka Policy Gradient for Entropy-Regularized RL: Convergence and Global Optimality [0.5261718469769449]
A novel Policy Gradient (PG) algorithm, called $textitMatryoshka Policy Gradient$ (MPG), is introduced and studied.
We prove uniqueness and characterize the optimal policy of the entropy regularized objective, together with global convergence of MPG.
As a proof of concept, we evaluate numerically MPG on standard test benchmarks.
arXiv Detail & Related papers (2023-03-22T17:56:18Z) - The Role of Baselines in Policy Gradient Optimization [83.42050606055822]
We show that the emphstate value baseline allows on-policy.
emphnatural policy gradient (NPG) to converge to a globally optimal.
policy at an $O (1/t) rate gradient.
We find that the primary effect of the value baseline is to textbfreduce the aggressiveness of the updates rather than their variance.
arXiv Detail & Related papers (2023-01-16T06:28:00Z) - Anchor-Changing Regularized Natural Policy Gradient for Multi-Objective
Reinforcement Learning [17.916366827429034]
We study policy optimization for Markov decision processes (MDPs) with multiple reward value functions.
We propose an Anchor-changing Regularized Natural Policy Gradient framework, which can incorporate ideas from well-performing first-order methods.
arXiv Detail & Related papers (2022-06-10T21:09:44Z) - Convergence and sample complexity of natural policy gradient primal-dual methods for constrained MDPs [21.347689976296834]
We employ the natural policy gradient method to solve the discounted optimal optimal rate problem.
We also provide convergence and finite-sample guarantees for two sample-based NPG-PD algorithms.
arXiv Detail & Related papers (2022-06-06T04:28:04Z) - Softmax Policy Gradient Methods Can Take Exponential Time to Converge [60.98700344526674]
The softmax policy gradient (PG) method is arguably one of the de facto implementations of policy optimization in modern reinforcement learning.
We demonstrate that softmax PG methods can take exponential time -- in terms of $mathcalS|$ and $frac11-gamma$ -- to converge.
arXiv Detail & Related papers (2021-02-22T18:56:26Z) - Zeroth-order Deterministic Policy Gradient [116.87117204825105]
We introduce Zeroth-order Deterministic Policy Gradient (ZDPG)
ZDPG approximates policy-reward gradients via two-point evaluations of the $Q$function.
New finite sample complexity bounds for ZDPG improve upon existing results by up to two orders of magnitude.
arXiv Detail & Related papers (2020-06-12T16:52:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.