Related papers: Symmetric (Optimistic) Natural Policy Gradient for Multi-agent Learning with Parameter Convergence

Symmetric (Optimistic) Natural Policy Gradient for Multi-agent Learning with Parameter Convergence

URL: http://arxiv.org/abs/2210.12812v2
Date: Mon, 20 Mar 2023 13:56:49 GMT
Title: Symmetric (Optimistic) Natural Policy Gradient for Multi-agent Learning with Parameter Convergence
Authors: Sarath Pattathil, Kaiqing Zhang, Asuman Ozdaglar
Abstract summary: We investigate the global convergence of natural policy gradient approximation variants in multi-agent learning. We propose an algorithm for several standard multi-agent learning scenarios.
Score: 18.412945308419033
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multi-agent interactions are increasingly important in the context of reinforcement learning, and the theoretical foundations of policy gradient methods have attracted surging research interest. We investigate the global convergence of natural policy gradient (NPG) algorithms in multi-agent learning. We first show that vanilla NPG may not have parameter convergence, i.e., the convergence of the vector that parameterizes the policy, even when the costs are regularized (which enabled strong convergence guarantees in the policy space in the literature). This non-convergence of parameters leads to stability issues in learning, which becomes especially relevant in the function approximation setting, where we can only operate on low-dimensional parameters, instead of the high-dimensional policy. We then propose variants of the NPG algorithm, for several standard multi-agent learning scenarios: two-player zero-sum matrix and Markov games, and multi-player monotone games, with global last-iterate parameter convergence guarantees. We also generalize the results to certain function approximation settings. Note that in our algorithms, the agents take symmetric roles. Our results might also be of independent interest for solving nonconvex-nonconcave minimax optimization problems with certain structures. Simulations are also provided to corroborate our theoretical findings.

Related papers

Improving monotonic optimization in heterogeneous multi-agent reinforcement learning with optimal marginal deterministic policy gradient [18.64288030584699]
heterogeneous multi-agent reinforcement learning (MARL)<n>Objectively replace the sequentially computed $Q_psi*(s,a_1:i)$ with the Optimal Marginal Q function $phi_psi*(s,a_1:i)$ derived from Q-functions.<n>Generalized Q Critic (GQC) as the critic function, employing pessimistic uncertainty-constrained loss to optimize different Q-value estimations.
arXiv Detail & Related papers (2025-07-14T07:16:01Z)
Convergence and Sample Complexity of First-Order Methods for Agnostic Reinforcement Learning [66.4260157478436]
We study reinforcement learning in the policy learning setting.<n>The goal is to find a policy whose performance is competitive with the best policy in a given class of interest.
arXiv Detail & Related papers (2025-07-06T14:40:05Z)
Rethinking the Global Convergence of Softmax Policy Gradient with Linear Function Approximation [52.772454746132276]
We show that the approximation error in modeling problem-dependent quantities is irrelevant to the algorithm's global convergence.<n>We prove that $textttLin-SPG$ with any arbitrary constant learning rate can ensure global convergence to the optimal policy.
arXiv Detail & Related papers (2025-05-06T04:03:06Z)
Multilinear Tensor Low-Rank Approximation for Policy-Gradient Methods in Reinforcement Learning [27.868175900131313]
Reinforcement learning (RL) aims to estimate the action to take given a (time-varying) state. This paper postulates multi-linear mappings to efficiently estimate the parameters of the RL policy. We leverage the PARAFAC decomposition to design tensor low-rank policies.
arXiv Detail & Related papers (2025-01-08T23:22:08Z)
Learning Optimal Deterministic Policies with Stochastic Policy Gradients [62.81324245896716]
Policy gradient (PG) methods are successful approaches to deal with continuous reinforcement learning (RL) problems. In common practice, convergence (hyper)policies are learned only to deploy their deterministic version. We show how to tune the exploration level used for learning to optimize the trade-off between the sample complexity and the performance of the deployed deterministic policy.
arXiv Detail & Related papers (2024-05-03T16:45:15Z)
High-probability sample complexities for policy evaluation with linear function approximation [88.87036653258977]
We investigate the sample complexities required to guarantee a predefined estimation error of the best linear coefficients for two widely-used policy evaluation algorithms. We establish the first sample complexity bound with high-probability convergence guarantee that attains the optimal dependence on the tolerance level.
arXiv Detail & Related papers (2023-05-30T12:58:39Z)
Local Optimization Achieves Global Optimality in Multi-Agent Reinforcement Learning [139.53668999720605]
We present a multi-agent PPO algorithm in which the local policy of each agent is updated similarly to vanilla PPO. We prove that with standard regularity conditions on the Markov game and problem-dependent quantities, our algorithm converges to the globally optimal policy at a sublinear rate.
arXiv Detail & Related papers (2023-05-08T16:20:03Z)
Smoothing Policy Iteration for Zero-sum Markov Games [9.158672246275348]
We propose the smoothing policy robustness (SPI) algorithm to solve the zero-sum MGs approximately. Specially, the adversarial policy is served as the weight function to enable an efficient sampling over action spaces. We also propose a model-based algorithm called Smooth adversarial Actor-critic (SaAC) by extending SPI with the function approximations.
arXiv Detail & Related papers (2022-12-03T14:39:06Z)
Faster Last-iterate Convergence of Policy Optimization in Zero-Sum Markov Games [63.60117916422867]
This paper focuses on the most basic setting of competitive multi-agent RL, namely two-player zero-sum Markov games. We propose a single-loop policy optimization method with symmetric updates from both agents, where the policy is updated via the entropy-regularized optimistic multiplicative weights update (OMWU) method. Our convergence results improve upon the best known complexities, and lead to a better understanding of policy optimization in competitive Markov games.
arXiv Detail & Related papers (2022-10-03T16:05:43Z)
Convergence and sample complexity of natural policy gradient primal-dual methods for constrained MDPs [21.347689976296834]
We employ the natural policy gradient method to solve the discounted optimal optimal rate problem. We also provide convergence and finite-sample guarantees for two sample-based NPG-PD algorithms.
arXiv Detail & Related papers (2022-06-06T04:28:04Z)
Dimensionality Reduction and Prioritized Exploration for Policy Search [29.310742141970394]
Black-box policy optimization is a class of reinforcement learning algorithms that explores and updates the policies at the parameter level. We present a novel method to prioritize the exploration of effective parameters and cope with full covariance matrix updates. Our algorithm learns faster than recent approaches and requires fewer samples to achieve state-of-the-art results.
arXiv Detail & Related papers (2022-03-09T15:17:09Z)
On the Hidden Biases of Policy Mirror Ascent in Continuous Action Spaces [23.186300629667134]
We study the convergence of policy gradient algorithms under heavy-tailed parameterizations. Our main theoretical contribution is the establishment that this scheme converges with constant step and batch sizes.
arXiv Detail & Related papers (2022-01-28T18:54:30Z)
A general sample complexity analysis of vanilla policy gradient [101.16957584135767]
Policy gradient (PG) is one of the most popular reinforcement learning (RL) problems. "vanilla" theoretical understanding of PG trajectory is one of the most popular methods for solving RL problems.
arXiv Detail & Related papers (2021-07-23T19:38:17Z)
Computational Performance of Deep Reinforcement Learning to find Nash Equilibria [0.0]
We use a deep reinforcement learning algorithm to learn Nash equilibria in a setting where firms compete in prices. Despite being model-free, a large set of parameters are utilized in various steps of the algorithm. We find parameter choices that can reach convergence rates of up to 99%.
arXiv Detail & Related papers (2021-04-26T22:14:17Z)
Global Convergence of Policy Gradient for Linear-Quadratic Mean-Field Control/Game in Continuous Time [109.06623773924737]
We study the policy gradient method for the linear-quadratic mean-field control and game. We show that it converges to the optimal solution at a linear rate, which is verified by a synthetic simulation.
arXiv Detail & Related papers (2020-08-16T06:34:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.