Descent-Guided Policy Gradient for Scalable Cooperative Multi-Agent Learning
- URL: http://arxiv.org/abs/2602.20078v1
- Date: Mon, 23 Feb 2026 17:45:08 GMT
- Title: Descent-Guided Policy Gradient for Scalable Cooperative Multi-Agent Learning
- Authors: Shan Yang, Yang Liu,
- Abstract summary: Descent-Guided Policy Gradient (DG-PG) is a framework that constructs noise-free per-agent guidance gradients.<n>We prove that DG-PG reduces gradient variance from $(N)$ to $mathcalO(1)$, preserves the equilibria of the cooperative game, and achieves agent-independent sample complexity.
- Score: 14.185814237633958
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scaling cooperative multi-agent reinforcement learning (MARL) is fundamentally limited by cross-agent noise: when agents share a common reward, the actions of all $N$ agents jointly determine each agent's learning signal, so cross-agent noise grows with $N$. In the policy gradient setting, per-agent gradient estimate variance scales as $Θ(N)$, yielding sample complexity $\mathcal{O}(N/ε)$. We observe that many domains -- cloud computing, transportation, power systems -- have differentiable analytical models that prescribe efficient system states. In this work, we propose Descent-Guided Policy Gradient (DG-PG), a framework that constructs noise-free per-agent guidance gradients from these analytical models, decoupling each agent's gradient from the actions of all others. We prove that DG-PG reduces gradient variance from $Θ(N)$ to $\mathcal{O}(1)$, preserves the equilibria of the cooperative game, and achieves agent-independent sample complexity $\mathcal{O}(1/ε)$. On a heterogeneous cloud scheduling task with up to 200 agents, DG-PG converges within 10 episodes at every tested scale -- from $N=5$ to $N=200$ -- directly confirming the predicted scale-invariant complexity, while MAPPO and IPPO fail to converge under identical architectures.
Related papers
- Learning Approximate Nash Equilibria in Cooperative Multi-Agent Reinforcement Learning via Mean-Field Subsampling [3.396870608435494]
We study a cooperative Markov game with a global agent and $n$ homogeneous local agents in a communication-constrained regime.<n>We prove that these approximate best-response dynamics converge to an $widetildeO (1/sqrtk)$-approximate Nash Equilibrium.
arXiv Detail & Related papers (2026-03-04T06:14:24Z) - Graphon Mean-Field Subsampling for Cooperative Heterogeneous Multi-Agent Reinforcement Learning [19.98996237281175]
We introduce $texttGMFS$, a $textbfG$raphon $textbfM$ean-$textbfF$ield $textbfS$ubsampling framework for scalable cooperative MARL with heterogeneous agent interactions.<n>By subsampling $$ agents according to interaction strength, we approximate the graphon-weighted mean-field and learn a policy with sample complexity.<n>We verify our theory with numerical simulations in robotic coordination, showing that $textttGMFS$ achieves near-optimal performance
arXiv Detail & Related papers (2026-02-18T05:34:07Z) - Phase Transition for Budgeted Multi-Agent Synergy [41.486076708302456]
Multi-agent systems can improve reliability, yet under a fixed inference budget they often help, saturate, or even collapse.<n>We develop a minimal and calibratable theory that predicts these regimes from three binding constraints of modern agent stacks.
arXiv Detail & Related papers (2026-01-24T05:32:50Z) - Towards a Science of Scaling Agent Systems [79.64446272302287]
We formalize a definition for agent evaluation and characterize scaling laws as the interplay between agent quantity, coordination structure, modelic, and task properties.<n>We derive a predictive model using coordination metrics, that cross-validated R2=0, enabling prediction on unseen task domains.<n>We identify three effects: (1) a tool-coordination trade-off: under fixed computational budgets, tool-heavy tasks suffer disproportionately from multi-agent overhead, and (2) a capability saturation: coordination yields diminishing or negative returns once single-agent baselines exceed 45%.
arXiv Detail & Related papers (2025-12-09T06:52:21Z) - Near-Optimal Online Learning for Multi-Agent Submodular Coordination: Tight Approximation and Communication Efficiency [52.60557300927007]
We present a $textbfMA-OSMA$ algorithm to transfer the discrete submodular problem into a continuous optimization.<n>We also introduce a projection-free $textbfMA-OSEA$ algorithm, which effectively utilizes the KL divergence by mixing a uniform distribution.<n>Our algorithms significantly improve the $(frac11+c)$-approximation provided by the state-of-the-art OSG algorithm.
arXiv Detail & Related papers (2025-02-07T15:57:56Z) - Mean-Field Sampling for Cooperative Multi-Agent Reinforcement Learning [8.400105595501158]
We propose a new $textttSUBPLE-MFQ$ ($textbfSubsample$-$textbfMean-$textbfF$ield-$textbfQ$-learning) and a decentralized randomized policy for a system with $n$ agents.<n>We prove that this learned policy converges to the optimal policy on the order of $tilde$O (1/sqrtk)$ as the number of subsampled agents $k$ increases.
arXiv Detail & Related papers (2024-12-01T03:45:17Z) - Asynchronous Federated Reinforcement Learning with Policy Gradient Updates: Algorithm Design and Convergence Analysis [41.75366066380951]
We propose a novel asynchronous federated reinforcement learning framework termed AFedPG, which constructs a global model through collaboration among $N$ agents.<n>We analyze the theoretical global convergence bound of AFedPG, and characterize the advantage of the proposed algorithm in terms of both the sample complexity and time complexity.<n>We empirically verify the improved performance of AFedPG in four widely used MuJoCo environments with varying numbers of agents.
arXiv Detail & Related papers (2024-04-09T04:21:13Z) - Refined Sample Complexity for Markov Games with Independent Linear Function Approximation [49.5660193419984]
Markov Games (MG) is an important model for Multi-Agent Reinforcement Learning (MARL)
This paper first refines the AVLPR framework by Wang et al. (2023), with an insight of designing pessimistic estimation of the sub-optimality gap.
We give the first algorithm that tackles the curse of multi-agents, attains the optimal $O(T-1/2) convergence rate, and avoids $textpoly(A_max)$ dependency simultaneously.
arXiv Detail & Related papers (2024-02-11T01:51:15Z) - Adaptive, Doubly Optimal No-Regret Learning in Strongly Monotone and Exp-Concave Games with Gradient Feedback [75.29048190099523]
Online gradient descent (OGD) is well known to be doubly optimal under strong convexity or monotonicity assumptions.
In this paper, we design a fully adaptive OGD algorithm, textsfAdaOGD, that does not require a priori knowledge of these parameters.
arXiv Detail & Related papers (2023-10-21T18:38:13Z) - Optimal Horizon-Free Reward-Free Exploration for Linear Mixture MDPs [60.40452803295326]
We propose a new reward-free algorithm for learning linear Markov decision processes (MDPs)
At the core of our algorithm is uncertainty-weighted value-targeted regression with exploration-driven pseudo-reward.
We show that our algorithm only needs to explore $tilde O( d2varepsilon-2)$ episodes to find an $varepsilon$-optimal policy.
arXiv Detail & Related papers (2023-03-17T17:53:28Z) - Model-Based Multi-Agent RL in Zero-Sum Markov Games with Near-Optimal
Sample Complexity [67.02490430380415]
We show that model-based MARL achieves a sample complexity of $tilde O(|S||B|(gamma)-3epsilon-2)$ for finding the Nash equilibrium (NE) value up to some $epsilon$ error.
We also show that such a sample bound is minimax-optimal (up to logarithmic factors) if the algorithm is reward-agnostic, where the algorithm queries state transition samples without reward knowledge.
arXiv Detail & Related papers (2020-07-15T03:25:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.