Related papers: Descent-Guided Policy Gradient for Scalable Cooperative Multi-Agent Learning

Descent-Guided Policy Gradient for Scalable Cooperative Multi-Agent Learning

URL: http://arxiv.org/abs/2602.20078v1
Date: Mon, 23 Feb 2026 17:45:08 GMT
Title: Descent-Guided Policy Gradient for Scalable Cooperative Multi-Agent Learning
Authors: Shan Yang, Yang Liu,
Abstract summary: Descent-Guided Policy Gradient (DG-PG) is a framework that constructs noise-free per-agent guidance gradients.<n>We prove that DG-PG reduces gradient variance from $(N)$ to $mathcalO(1)$, preserves the equilibria of the cooperative game, and achieves agent-independent sample complexity.
Score: 14.185814237633958
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Scaling cooperative multi-agent reinforcement learning (MARL) is fundamentally limited by cross-agent noise: when agents share a common reward, the actions of all $N$ agents jointly determine each agent's learning signal, so cross-agent noise grows with $N$. In the policy gradient setting, per-agent gradient estimate variance scales as $Θ(N)$, yielding sample complexity $\mathcal{O}(N/ε)$. We observe that many domains -- cloud computing, transportation, power systems -- have differentiable analytical models that prescribe efficient system states. In this work, we propose Descent-Guided Policy Gradient (DG-PG), a framework that constructs noise-free per-agent guidance gradients from these analytical models, decoupling each agent's gradient from the actions of all others. We prove that DG-PG reduces gradient variance from $Θ(N)$ to $\mathcal{O}(1)$, preserves the equilibria of the cooperative game, and achieves agent-independent sample complexity $\mathcal{O}(1/ε)$. On a heterogeneous cloud scheduling task with up to 200 agents, DG-PG converges within 10 episodes at every tested scale -- from $N=5$ to $N=200$ -- directly confirming the predicted scale-invariant complexity, while MAPPO and IPPO fail to converge under identical architectures.

Related papers

Learning Approximate Nash Equilibria in Cooperative Multi-Agent Reinforcement Learning via Mean-Field Subsampling [3.396870608435494]
We study a cooperative Markov game with a global agent and $n$ homogeneous local agents in a communication-constrained regime.<n>We prove that these approximate best-response dynamics converge to an $widetildeO (1/sqrtk)$-approximate Nash Equilibrium.
arXiv Detail & Related papers (2026-03-04T06:14:24Z)
Graphon Mean-Field Subsampling for Cooperative Heterogeneous Multi-Agent Reinforcement Learning [19.98996237281175]
We introduce $texttGMFS$, a $textbfG$raphon $textbfM$ean-$textbfF$ield $textbfS$ubsampling framework for scalable cooperative MARL with heterogeneous agent interactions.<n>By subsampling $$ agents according to interaction strength, we approximate the graphon-weighted mean-field and learn a policy with sample complexity.<n>We verify our theory with numerical simulations in robotic coordination, showing that $textttGMFS$ achieves near-optimal performance
arXiv Detail & Related papers (2026-02-18T05:34:07Z)
Phase Transition for Budgeted Multi-Agent Synergy [41.486076708302456]
Multi-agent systems can improve reliability, yet under a fixed inference budget they often help, saturate, or even collapse.<n>We develop a minimal and calibratable theory that predicts these regimes from three binding constraints of modern agent stacks.
arXiv Detail & Related papers (2026-01-24T05:32:50Z)
Towards a Science of Scaling Agent Systems [79.64446272302287]
We formalize a definition for agent evaluation and characterize scaling laws as the interplay between agent quantity, coordination structure, modelic, and task properties.<n>We derive a predictive model using coordination metrics, that cross-validated R2=0, enabling prediction on unseen task domains.<n>We identify three effects: (1) a tool-coordination trade-off: under fixed computational budgets, tool-heavy tasks suffer disproportionately from multi-agent overhead, and (2) a capability saturation: coordination yields diminishing or negative returns once single-agent baselines exceed 45%.
arXiv Detail & Related papers (2025-12-09T06:52:21Z)
Near-Optimal Online Learning for Multi-Agent Submodular Coordination: Tight Approximation and Communication Efficiency [52.60557300927007]
We present a $textbfMA-OSMA$ algorithm to transfer the discrete submodular problem into a continuous optimization.<n>We also introduce a projection-free $textbfMA-OSEA$ algorithm, which effectively utilizes the KL divergence by mixing a uniform distribution.<n>Our algorithms significantly improve the $(frac11+c)$-approximation provided by the state-of-the-art OSG algorithm.
arXiv Detail & Related papers (2025-02-07T15:57:56Z)
Mean-Field Sampling for Cooperative Multi-Agent Reinforcement Learning [8.400105595501158]
We propose a new $textttSUBPLE-MFQ$ ($textbfSubsample$-$textbfMean-$textbfF$ield-$textbfQ$-learning) and a decentralized randomized policy for a system with $n$ agents.<n>We prove that this learned policy converges to the optimal policy on the order of $tilde$O (1/sqrtk)$ as the number of subsampled agents $k$ increases.
arXiv Detail & Related papers (2024-12-01T03:45:17Z)
Asynchronous Federated Reinforcement Learning with Policy Gradient Updates: Algorithm Design and Convergence Analysis [41.75366066380951]
We propose a novel asynchronous federated reinforcement learning framework termed AFedPG, which constructs a global model through collaboration among $N$ agents.<n>We analyze the theoretical global convergence bound of AFedPG, and characterize the advantage of the proposed algorithm in terms of both the sample complexity and time complexity.<n>We empirically verify the improved performance of AFedPG in four widely used MuJoCo environments with varying numbers of agents.
arXiv Detail & Related papers (2024-04-09T04:21:13Z)
Refined Sample Complexity for Markov Games with Independent Linear Function Approximation [49.5660193419984]
Markov Games (MG) is an important model for Multi-Agent Reinforcement Learning (MARL) This paper first refines the AVLPR framework by Wang et al. (2023), with an insight of designing pessimistic estimation of the sub-optimality gap. We give the first algorithm that tackles the curse of multi-agents, attains the optimal $O(T-1/2) convergence rate, and avoids $textpoly(A_max)$ dependency simultaneously.
arXiv Detail & Related papers (2024-02-11T01:51:15Z)
Adaptive, Doubly Optimal No-Regret Learning in Strongly Monotone and Exp-Concave Games with Gradient Feedback [75.29048190099523]
Online gradient descent (OGD) is well known to be doubly optimal under strong convexity or monotonicity assumptions. In this paper, we design a fully adaptive OGD algorithm, textsfAdaOGD, that does not require a priori knowledge of these parameters.
arXiv Detail & Related papers (2023-10-21T18:38:13Z)
Optimal Horizon-Free Reward-Free Exploration for Linear Mixture MDPs [60.40452803295326]
We propose a new reward-free algorithm for learning linear Markov decision processes (MDPs) At the core of our algorithm is uncertainty-weighted value-targeted regression with exploration-driven pseudo-reward. We show that our algorithm only needs to explore $tilde O( d2varepsilon-2)$ episodes to find an $varepsilon$-optimal policy.
arXiv Detail & Related papers (2023-03-17T17:53:28Z)
Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal Sample Complexity [67.02490430380415]
We show that model-based MARL achieves a sample complexity of $tilde O(|S||B|(gamma)-3epsilon-2)$ for finding the Nash equilibrium (NE) value up to some $epsilon$ error. We also show that such a sample bound is minimax-optimal (up to logarithmic factors) if the algorithm is reward-agnostic, where the algorithm queries state transition samples without reward knowledge.
arXiv Detail & Related papers (2020-07-15T03:25:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.