Provably Learning Nash Policies in Constrained Markov Potential Games
- URL: http://arxiv.org/abs/2306.07749v1
- Date: Tue, 13 Jun 2023 13:08:31 GMT
- Title: Provably Learning Nash Policies in Constrained Markov Potential Games
- Authors: Pragnya Alatur, Giorgia Ramponi, Niao He, Andreas Krause
- Abstract summary: Multi-agent reinforcement learning (MARL) addresses sequential decision-making problems with multiple agents.
Constrained Markov Games (CMGs) are a natural formalism for safe MARL problems, though generally intractable.
- Score: 90.87573337770293
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-agent reinforcement learning (MARL) addresses sequential
decision-making problems with multiple agents, where each agent optimizes its
own objective. In many real-world instances, the agents may not only want to
optimize their objectives, but also ensure safe behavior. For example, in
traffic routing, each car (agent) aims to reach its destination quickly
(objective) while avoiding collisions (safety). Constrained Markov Games (CMGs)
are a natural formalism for safe MARL problems, though generally intractable.
In this work, we introduce and study Constrained Markov Potential Games
(CMPGs), an important class of CMGs. We first show that a Nash policy for CMPGs
can be found via constrained optimization. One tempting approach is to solve it
by Lagrangian-based primal-dual methods. As we show, in contrast to the
single-agent setting, however, CMPGs do not satisfy strong duality, rendering
such approaches inapplicable and potentially unsafe. To solve the CMPG problem,
we propose our algorithm Coordinate-Ascent for CMPGs (CA-CMPG), which provably
converges to a Nash policy in tabular, finite-horizon CMPGs. Furthermore, we
provide the first sample complexity bounds for learning Nash policies in
unknown CMPGs, and, which under additional assumptions, guarantee safe
exploration.
Related papers
- Last-Iterate Global Convergence of Policy Gradients for Constrained Reinforcement Learning [62.81324245896717]
We introduce an exploration-agnostic algorithm, called C-PG, which exhibits global last-ite convergence guarantees under (weak) gradient domination assumptions.
We numerically validate our algorithms on constrained control problems, and compare them with state-of-the-art baselines.
arXiv Detail & Related papers (2024-07-15T14:54:57Z) - Safe Multi-Agent Reinforcement Learning with Bilevel Optimization in Autonomous Driving [3.5293763645151404]
We propose a safe MARL method grounded in a Stackelberg model with bi-level optimization.
We develop two practical algorithms, namely Constrained Stackelberg Q-learning (CSQ) and Constrained Stackelberg Multi-Agent Deep Deterministic Policy Gradient (CS-MADDPG)
Our algorithms, CSQ and CS-MADDPG, outperform several strong MARL baselines, such as Bi-AC, MACPO, and MAPPO-L, regarding reward and safety performance.
arXiv Detail & Related papers (2024-05-28T14:15:18Z) - Uniformly Safe RL with Objective Suppression for Multi-Constraint Safety-Critical Applications [73.58451824894568]
The widely adopted CMDP model constrains the risks in expectation, which makes room for dangerous behaviors in long-tail states.
In safety-critical domains, such behaviors could lead to disastrous outcomes.
We propose Objective Suppression, a novel method that adaptively suppresses the task reward maximizing objectives according to a safety critic.
arXiv Detail & Related papers (2024-02-23T23:22:06Z) - What is the Solution for State-Adversarial Multi-Agent Reinforcement Learning? [22.863241480702012]
Policies learned through Deep Reinforcement Learning (DRL) are susceptible to adversarial state perturbation attacks.
We propose a State-Adversarial Markov Game (SAMG) and make the first attempt to investigate different solution concepts of MARL under state uncertainties.
arXiv Detail & Related papers (2022-12-06T01:57:33Z) - Near-Optimal Multi-Agent Learning for Safe Coverage Control [76.99020416197631]
In multi-agent coverage control problems, agents navigate their environment to reach locations that maximize the coverage of some density.
In this paper, we aim to efficiently learn the density to approximately solve the coverage problem while preserving the agents' safety.
We give first of its kind results: near optimal coverage in finite time while provably guaranteeing safety.
arXiv Detail & Related papers (2022-10-12T16:33:34Z) - Revisiting Some Common Practices in Cooperative Multi-Agent
Reinforcement Learning [11.91425153754564]
We show that in environments with a highly multi-modal reward landscape, value decomposition, and parameter sharing can be problematic and lead to undesired outcomes.
In contrast, policy gradient (PG) methods with individual policies provably converge to an optimal solution in these cases.
We present practical suggestions on implementing multi-agent PG algorithms for either high rewards or diverse emergent behaviors.
arXiv Detail & Related papers (2022-06-15T13:03:05Z) - Learning to Coordinate in Multi-Agent Systems: A Coordinated
Actor-Critic Algorithm and Finite-Time Guarantees [43.10380224532313]
We study the emergence of coordinated behavior by autonomous agents using an actor-critic (AC) algorithm.
We propose and analyze a class of coordinated actor-critic algorithms (CAC) in which individually parametrized policies have a it shared part and a it personalized part.
This work provides the first finite-sample guarantee for decentralized AC algorithm with partially personalized policies.
arXiv Detail & Related papers (2021-10-11T20:26:16Z) - Multi-Agent Constrained Policy Optimisation [17.772811770726296]
We formulate the safe MARL problem as a constrained Markov game and solve it with policy optimisation methods.
Our solutions -- Multi-Agent Constrained Policy optimisation (MACPO) and MAPPO-Lagrangian -- leverage the theories from both constrained policy optimisation and multi-agent trust region learning.
We develop the benchmark suite of Safe Multi-Agent MuJoCo that involves a variety of MARL baselines.
arXiv Detail & Related papers (2021-10-06T14:17:09Z) - Softmax Policy Gradient Methods Can Take Exponential Time to Converge [60.98700344526674]
The softmax policy gradient (PG) method is arguably one of the de facto implementations of policy optimization in modern reinforcement learning.
We demonstrate that softmax PG methods can take exponential time -- in terms of $mathcalS|$ and $frac11-gamma$ -- to converge.
arXiv Detail & Related papers (2021-02-22T18:56:26Z) - FACMAC: Factored Multi-Agent Centralised Policy Gradients [103.30380537282517]
We propose FACtored Multi-Agent Centralised policy gradients (FACMAC)
It is a new method for cooperative multi-agent reinforcement learning in both discrete and continuous action spaces.
We evaluate FACMAC on variants of the multi-agent particle environments, a novel multi-agent MuJoCo benchmark, and a challenging set of StarCraft II micromanagement tasks.
arXiv Detail & Related papers (2020-03-14T21:29:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.