Scalable Multi-Agent Reinforcement Learning with General Utilities
- URL: http://arxiv.org/abs/2302.07938v2
- Date: Sun, 27 Aug 2023 00:08:01 GMT
- Title: Scalable Multi-Agent Reinforcement Learning with General Utilities
- Authors: Donghao Ying, Yuhao Ding, Alec Koppel, Javad Lavaei
- Abstract summary: We study the scalable multi-agent reinforcement learning (MARL) with general utilities.
The objective is to find a localized policy that maximizes the average of the team's local utility functions without the full observability of each agent in the team.
This is the first result in the literature on multi-agent RL with general utilities that does not require the full observability.
- Score: 30.960413388976438
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: We study the scalable multi-agent reinforcement learning (MARL) with general
utilities, defined as nonlinear functions of the team's long-term state-action
occupancy measure. The objective is to find a localized policy that maximizes
the average of the team's local utility functions without the full
observability of each agent in the team. By exploiting the spatial correlation
decay property of the network structure, we propose a scalable distributed
policy gradient algorithm with shadow reward and localized policy that consists
of three steps: (1) shadow reward estimation, (2) truncated shadow Q-function
estimation, and (3) truncated policy gradient estimation and policy update. Our
algorithm converges, with high probability, to $\epsilon$-stationarity with
$\widetilde{\mathcal{O}}(\epsilon^{-2})$ samples up to some approximation error
that decreases exponentially in the communication radius. This is the first
result in the literature on multi-agent RL with general utilities that does not
require the full observability.
Related papers
- Efficient Reinforcement Learning for Global Decision Making in the Presence of Local Agents at Scale [5.3526997662068085]
We study reinforcement learning for global decision-making in the presence of local agents.
In this setting, scalability has been a long-standing challenge due to the size of the state space.
We show that this learned policy converges to the optimal policy in the order of $tildeO (1/sqrtk+epsilon_k,m)$ as the number of sub-sampled agents increases.
arXiv Detail & Related papers (2024-03-01T01:49:57Z) - Scalable Primal-Dual Actor-Critic Method for Safe Multi-Agent RL with
General Utilities [12.104551746465932]
We investigate safe multi-agent reinforcement learning, where agents seek to collectively maximize an aggregate sum of local objectives while satisfying their own safety constraints.
Our algorithm converges to a first-order stationary point (FOSP) at the rate of $mathcalOleft(T-2/3right)$.
In the sample-based setting, we demonstrate that, with high probability, our algorithm requires $widetildemathcalOleft(epsilon-3.5right)$ samples to achieve an $epsilon$-FOSP.
arXiv Detail & Related papers (2023-05-27T20:08:35Z) - Local Optimization Achieves Global Optimality in Multi-Agent
Reinforcement Learning [139.53668999720605]
We present a multi-agent PPO algorithm in which the local policy of each agent is updated similarly to vanilla PPO.
We prove that with standard regularity conditions on the Markov game and problem-dependent quantities, our algorithm converges to the globally optimal policy at a sublinear rate.
arXiv Detail & Related papers (2023-05-08T16:20:03Z) - Global Convergence of Localized Policy Iteration in Networked
Multi-Agent Reinforcement Learning [25.747559058350557]
We study a multi-agent reinforcement learning (MARL) problem where the agents interact over a given network.
The goal of the agents is to cooperatively maximize the average of their entropy-regularized long-term rewards.
To overcome the curse of dimensionality and to reduce communication, we propose a Localized Policy Iteration (LPI) that provably learns a near-globally-optimal policy using only local information.
arXiv Detail & Related papers (2022-11-30T15:58:00Z) - Provably Efficient Offline Reinforcement Learning with Trajectory-Wise
Reward [66.81579829897392]
We propose a novel offline reinforcement learning algorithm called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED)
PARTED decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value based on the learned proxy reward.
To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.
arXiv Detail & Related papers (2022-06-13T19:11:22Z) - MDPGT: Momentum-based Decentralized Policy Gradient Tracking [29.22173174168708]
We propose a momentum-based decentralized policy gradient tracking (MDPGT) for multi-agent reinforcement learning.
MDPGT achieves the best available sample complexity of $mathcalO(N-1epsilon-3)$ for converging to an $epsilon-stationary point of the global average of $N$ local performance functions.
This outperforms the state-of-the-art sample complexity in decentralized model-free reinforcement learning.
arXiv Detail & Related papers (2021-12-06T06:55:51Z) - Online Sub-Sampling for Reinforcement Learning with General Function
Approximation [111.01990889581243]
In this paper, we establish an efficient online sub-sampling framework that measures the information gain of data points collected by an RL algorithm.
For a value-based method with complexity-bounded function class, we show that the policy only needs to be updated for $proptooperatornamepolylog(K)$ times.
In contrast to existing approaches that update the policy for at least $Omega(K)$ times, our approach drastically reduces the number of optimization calls in solving for a policy.
arXiv Detail & Related papers (2021-06-14T07:36:25Z) - Policy Mirror Descent for Regularized Reinforcement Learning: A
Generalized Framework with Linear Convergence [60.20076757208645]
This paper proposes a general policy mirror descent (GPMD) algorithm for solving regularized RL.
We demonstrate that our algorithm converges linearly over an entire range learning rates, in a dimension-free fashion, to the global solution.
arXiv Detail & Related papers (2021-05-24T02:21:34Z) - Breaking the Deadly Triad with a Target Network [80.82586530205776]
The deadly triad refers to the instability of a reinforcement learning algorithm when it employs off-policy learning, function approximation, and bootstrapping simultaneously.
We provide the first convergent linear $Q$-learning algorithms under nonrestrictive and changing behavior policies without bi-level optimization.
arXiv Detail & Related papers (2021-01-21T21:50:10Z) - Cooperative Multi-Agent Reinforcement Learning with Partial Observations [16.895704973433382]
We propose a distributed zeroth-order policy optimization method for Multi-Agent Reinforcement Learning (MARL)
It allows the agents to compute the local policy gradients needed to update their local policy functions using local estimates of the global accumulated rewards.
We show that the proposed distributed zeroth-order policy optimization method with constant stepsize converges to the neighborhood of a policy that is a stationary point of the global objective function.
arXiv Detail & Related papers (2020-06-18T19:36:22Z) - Provably Efficient Safe Exploration via Primal-Dual Policy Optimization [105.7510838453122]
We study the Safe Reinforcement Learning (SRL) problem using the Constrained Markov Decision Process (CMDP) formulation.
We present an provably efficient online policy optimization algorithm for CMDP with safe exploration in the function approximation setting.
arXiv Detail & Related papers (2020-03-01T17:47:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.