Scalable Primal-Dual Actor-Critic Method for Safe Multi-Agent RL with
General Utilities
- URL: http://arxiv.org/abs/2305.17568v1
- Date: Sat, 27 May 2023 20:08:35 GMT
- Title: Scalable Primal-Dual Actor-Critic Method for Safe Multi-Agent RL with
General Utilities
- Authors: Donghao Ying, Yunkai Zhang, Yuhao Ding, Alec Koppel, Javad Lavaei
- Abstract summary: We investigate safe multi-agent reinforcement learning, where agents seek to collectively maximize an aggregate sum of local objectives while satisfying their own safety constraints.
Our algorithm converges to a first-order stationary point (FOSP) at the rate of $mathcalOleft(T-2/3right)$.
In the sample-based setting, we demonstrate that, with high probability, our algorithm requires $widetildemathcalOleft(epsilon-3.5right)$ samples to achieve an $epsilon$-FOSP.
- Score: 12.104551746465932
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: We investigate safe multi-agent reinforcement learning, where agents seek to
collectively maximize an aggregate sum of local objectives while satisfying
their own safety constraints. The objective and constraints are described by
{\it general utilities}, i.e., nonlinear functions of the long-term
state-action occupancy measure, which encompass broader decision-making goals
such as risk, exploration, or imitations. The exponential growth of the
state-action space size with the number of agents presents challenges for
global observability, further exacerbated by the global coupling arising from
agents' safety constraints. To tackle this issue, we propose a primal-dual
method utilizing shadow reward and $\kappa$-hop neighbor truncation under a
form of correlation decay property, where $\kappa$ is the communication radius.
In the exact setting, our algorithm converges to a first-order stationary point
(FOSP) at the rate of $\mathcal{O}\left(T^{-2/3}\right)$. In the sample-based
setting, we demonstrate that, with high probability, our algorithm requires
$\widetilde{\mathcal{O}}\left(\epsilon^{-3.5}\right)$ samples to achieve an
$\epsilon$-FOSP with an approximation error of $\mathcal{O}(\phi_0^{2\kappa})$,
where $\phi_0\in (0,1)$. Finally, we demonstrate the effectiveness of our model
through extensive numerical experiments.
Related papers
- Cooperative Multi-Agent Constrained Stochastic Linear Bandits [2.099922236065961]
A network of $N$ agents communicate locally to minimize their collective regret while keeping their expected cost under a specified threshold $tau$.
We propose a safe distributed upper confidence bound algorithm, so called textitMA-OPLB, and establish a high probability bound on its $T$-round regret.
We show that our regret bound is of order $ mathcalOleft(fracdtau-c_0fraclog(NT)2sqrtNsqrtTlog (1/|lambda|)
arXiv Detail & Related papers (2024-10-22T19:34:53Z) - Cooperative Thresholded Lasso for Sparse Linear Bandit [6.52540785559241]
We present a novel approach to address the multi-agent sparse contextual linear bandit problem.
It is first algorithm that tackles row-wise distributed data in sparse linear bandits.
It is widely applicable to high-dimensional multi-agent problems where efficient feature extraction is critical for minimizing regret.
arXiv Detail & Related papers (2023-05-30T16:05:44Z) - Optimal Horizon-Free Reward-Free Exploration for Linear Mixture MDPs [60.40452803295326]
We propose a new reward-free algorithm for learning linear Markov decision processes (MDPs)
At the core of our algorithm is uncertainty-weighted value-targeted regression with exploration-driven pseudo-reward.
We show that our algorithm only needs to explore $tilde O( d2varepsilon-2)$ episodes to find an $varepsilon$-optimal policy.
arXiv Detail & Related papers (2023-03-17T17:53:28Z) - Scalable Multi-Agent Reinforcement Learning with General Utilities [30.960413388976438]
We study the scalable multi-agent reinforcement learning (MARL) with general utilities.
The objective is to find a localized policy that maximizes the average of the team's local utility functions without the full observability of each agent in the team.
This is the first result in the literature on multi-agent RL with general utilities that does not require the full observability.
arXiv Detail & Related papers (2023-02-15T20:47:43Z) - On Reward-Free RL with Kernel and Neural Function Approximations:
Single-Agent MDP and Markov Game [140.19656665344917]
We study the reward-free RL problem, where an agent aims to thoroughly explore the environment without any pre-specified reward function.
We tackle this problem under the context of function approximation, leveraging powerful function approximators.
We establish the first provably efficient reward-free RL algorithm with kernel and neural function approximators.
arXiv Detail & Related papers (2021-10-19T07:26:33Z) - Reward-Free Model-Based Reinforcement Learning with Linear Function
Approximation [92.99933928528797]
We study the model-based reward-free reinforcement learning with linear function approximation for episodic Markov decision processes (MDPs)
In the planning phase, the agent is given a specific reward function and uses samples collected from the exploration phase to learn a good policy.
We show that to obtain an $epsilon$-optimal policy for arbitrary reward function, UCRL-RFE needs to sample at most $tilde O(H4d(H + d)epsilon-2)$ episodes.
arXiv Detail & Related papers (2021-10-12T23:03:58Z) - Gap-Dependent Unsupervised Exploration for Reinforcement Learning [40.990467706237396]
We present an efficient algorithm for task-agnostic reinforcement learning.
The algorithm takes only $widetildemathcalO (1/epsilon cdot (H3SA / rho + H4 S2 A) )$ episodes of exploration.
We show that, information-theoretically, this bound is nearly tight for $rho Theta (1/(HS))$ and $H>1$.
arXiv Detail & Related papers (2021-08-11T20:42:46Z) - Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal
Sample Complexity [67.02490430380415]
We show that model-based MARL achieves a sample complexity of $tilde O(|S||B|(gamma)-3epsilon-2)$ for finding the Nash equilibrium (NE) value up to some $epsilon$ error.
We also show that such a sample bound is minimax-optimal (up to logarithmic factors) if the algorithm is reward-agnostic, where the algorithm queries state transition samples without reward knowledge.
arXiv Detail & Related papers (2020-07-15T03:25:24Z) - Provably Efficient Safe Exploration via Primal-Dual Policy Optimization [105.7510838453122]
We study the Safe Reinforcement Learning (SRL) problem using the Constrained Markov Decision Process (CMDP) formulation.
We present an provably efficient online policy optimization algorithm for CMDP with safe exploration in the function approximation setting.
arXiv Detail & Related papers (2020-03-01T17:47:03Z) - Reward-Free Exploration for Reinforcement Learning [82.3300753751066]
We propose a new "reward-free RL" framework to isolate the challenges of exploration.
We give an efficient algorithm that conducts $tildemathcalO(S2Amathrmpoly(H)/epsilon2)$ episodes of exploration.
We also give a nearly-matching $Omega(S2AH2/epsilon2)$ lower bound, demonstrating the near-optimality of our algorithm in this setting.
arXiv Detail & Related papers (2020-02-07T14:03:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.