MARL with General Utilities via Decentralized Shadow Reward Actor-Critic
- URL: http://arxiv.org/abs/2106.00543v1
- Date: Sat, 29 May 2021 19:05:48 GMT
- Title: MARL with General Utilities via Decentralized Shadow Reward Actor-Critic
- Authors: Junyu Zhang, Amrit Singh Bedi, Mengdi Wang, and Alec Koppel
- Abstract summary: bf Decentralized bf Shadow Reward bf Actor-bf Critic (DSAC)
Agents alternate between policy evaluation (critic), weighted averaging with neighbors (information mixing), and local gradient updates for their policy parameters (actor)
Experiments demonstrate the merits of goals beyond the cumulative return in cooperative MARL.
- Score: 43.249231735737865
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We posit a new mechanism for cooperation in multi-agent reinforcement
learning (MARL) based upon any nonlinear function of the team's long-term
state-action occupancy measure, i.e., a \emph{general utility}. This subsumes
the cumulative return but also allows one to incorporate risk-sensitivity,
exploration, and priors. % We derive the {\bf D}ecentralized {\bf S}hadow
Reward {\bf A}ctor-{\bf C}ritic (DSAC) in which agents alternate between policy
evaluation (critic), weighted averaging with neighbors (information mixing),
and local gradient updates for their policy parameters (actor). DSAC augments
the classic critic step by requiring agents to (i) estimate their local
occupancy measure in order to (ii) estimate the derivative of the local utility
with respect to their occupancy measure, i.e., the "shadow reward". DSAC
converges to $\epsilon$-stationarity in $\mathcal{O}(1/\epsilon^{2.5})$
(Theorem \ref{theorem:final}) or faster $\mathcal{O}(1/\epsilon^{2})$
(Corollary \ref{corollary:communication}) steps with high probability,
depending on the amount of communications. We further establish the
non-existence of spurious stationary points for this problem, that is, DSAC
finds the globally optimal policy (Corollary \ref{corollary:global}).
Experiments demonstrate the merits of goals beyond the cumulative return in
cooperative MARL.
Related papers
- Mean-Field Sampling for Cooperative Multi-Agent Reinforcement Learning [4.899818550820576]
We propose a new algorithm for multi-agent reinforcement learning.
We show that this learned policy converges to the optimal policy on the order of $tildeO (1/sqrtk)$ as the number of subsampled agents increases.
arXiv Detail & Related papers (2024-12-01T03:45:17Z) - Near-Optimal Dynamic Regret for Adversarial Linear Mixture MDPs [63.47351876442425]
We study episodic linear mixture MDPs with the unknown transition and adversarial rewards under full-information feedback.
We propose a novel algorithm that combines the benefits of two popular methods: occupancy-measure-based and policy-based.
Our algorithm enjoys an $widetildemathcalO(d sqrtH3 K + sqrtHK(H + barP_K$)$ dynamic regret, where $d$ is the feature dimension.
arXiv Detail & Related papers (2024-11-05T13:55:52Z) - Cooperative Multi-Agent Constrained Stochastic Linear Bandits [2.099922236065961]
A network of $N$ agents communicate locally to minimize their collective regret while keeping their expected cost under a specified threshold $tau$.
We propose a safe distributed upper confidence bound algorithm, so called textitMA-OPLB, and establish a high probability bound on its $T$-round regret.
We show that our regret bound is of order $ mathcalOleft(fracdtau-c_0fraclog(NT)2sqrtNsqrtTlog (1/|lambda|)
arXiv Detail & Related papers (2024-10-22T19:34:53Z) - Scalable Primal-Dual Actor-Critic Method for Safe Multi-Agent RL with
General Utilities [12.104551746465932]
We investigate safe multi-agent reinforcement learning, where agents seek to collectively maximize an aggregate sum of local objectives while satisfying their own safety constraints.
Our algorithm converges to a first-order stationary point (FOSP) at the rate of $mathcalOleft(T-2/3right)$.
In the sample-based setting, we demonstrate that, with high probability, our algorithm requires $widetildemathcalOleft(epsilon-3.5right)$ samples to achieve an $epsilon$-FOSP.
arXiv Detail & Related papers (2023-05-27T20:08:35Z) - Policy Mirror Descent Inherently Explores Action Space [10.772560347950053]
We establish for the first time an $tildemathcalO (1/epsilon2)$ sample complexity for online policy gradient methods without any exploration strategies.
New policy gradient methods can prevent repeatedly committing to potentially high-risk actions when searching for optimal policies.
arXiv Detail & Related papers (2023-03-08T05:19:08Z) - Scalable Multi-Agent Reinforcement Learning with General Utilities [30.960413388976438]
We study the scalable multi-agent reinforcement learning (MARL) with general utilities.
The objective is to find a localized policy that maximizes the average of the team's local utility functions without the full observability of each agent in the team.
This is the first result in the literature on multi-agent RL with general utilities that does not require the full observability.
arXiv Detail & Related papers (2023-02-15T20:47:43Z) - Provably Efficient Offline Reinforcement Learning with Trajectory-Wise
Reward [66.81579829897392]
We propose a novel offline reinforcement learning algorithm called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED)
PARTED decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value based on the learned proxy reward.
To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.
arXiv Detail & Related papers (2022-06-13T19:11:22Z) - Anti-Concentrated Confidence Bonuses for Scalable Exploration [57.91943847134011]
Intrinsic rewards play a central role in handling the exploration-exploitation trade-off.
We introduce emphanti-concentrated confidence bounds for efficiently approximating the elliptical bonus.
We develop a practical variant for deep reinforcement learning that is competitive with contemporary intrinsic rewards on Atari benchmarks.
arXiv Detail & Related papers (2021-10-21T15:25:15Z) - Reward-Free Model-Based Reinforcement Learning with Linear Function
Approximation [92.99933928528797]
We study the model-based reward-free reinforcement learning with linear function approximation for episodic Markov decision processes (MDPs)
In the planning phase, the agent is given a specific reward function and uses samples collected from the exploration phase to learn a good policy.
We show that to obtain an $epsilon$-optimal policy for arbitrary reward function, UCRL-RFE needs to sample at most $tilde O(H4d(H + d)epsilon-2)$ episodes.
arXiv Detail & Related papers (2021-10-12T23:03:58Z) - Provably Efficient Safe Exploration via Primal-Dual Policy Optimization [105.7510838453122]
We study the Safe Reinforcement Learning (SRL) problem using the Constrained Markov Decision Process (CMDP) formulation.
We present an provably efficient online policy optimization algorithm for CMDP with safe exploration in the function approximation setting.
arXiv Detail & Related papers (2020-03-01T17:47:03Z) - Reward-Free Exploration for Reinforcement Learning [82.3300753751066]
We propose a new "reward-free RL" framework to isolate the challenges of exploration.
We give an efficient algorithm that conducts $tildemathcalO(S2Amathrmpoly(H)/epsilon2)$ episodes of exploration.
We also give a nearly-matching $Omega(S2AH2/epsilon2)$ lower bound, demonstrating the near-optimality of our algorithm in this setting.
arXiv Detail & Related papers (2020-02-07T14:03:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.