Graph-GRPO: Stabilizing Multi-Agent Topology Learning via Group Relative Policy Optimization
- URL: http://arxiv.org/abs/2603.02701v1
- Date: Tue, 03 Mar 2026 07:45:40 GMT
- Title: Graph-GRPO: Stabilizing Multi-Agent Topology Learning via Group Relative Policy Optimization
- Authors: Yueyang Cang, Xiaoteng Zhang, Erlu Zhao, Zehua Ji, Yuhang Liu, Yuchen He, Zhiyuan Ning, Chen Yijun, Wenge Que, Li Shi,
- Abstract summary: We propose Graph-GRPO, a novel topology optimization framework that integrates Group Relative Policy Optimization.<n>By normalizing rewards across the sampled group, our method effectively mitigates the noise derived from task difficulty variance and enables fine-grained credit assignment.
- Score: 7.961090665261694
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Optimizing communication topology is fundamental to the efficiency and effectiveness of Large Language Model (LLM)-based Multi-Agent Systems (MAS). While recent approaches utilize reinforcement learning to dynamically construct task-specific graphs, they typically rely on single-sample policy gradients with absolute rewards (e.g., binary correctness). This paradigm suffers from severe gradient variance and the credit assignment problem: simple queries yield non-informative positive rewards for suboptimal structures, while difficult queries often result in failures that provide no learning signal. To address these challenges, we propose Graph-GRPO, a novel topology optimization framework that integrates Group Relative Policy Optimization. Instead of evaluating a single topology in isolation, Graph-GRPO samples a group of diverse communication graphs for each query and computes the advantage of specific edges based on their relative performance within the group. By normalizing rewards across the sampled group, our method effectively mitigates the noise derived from task difficulty variance and enables fine-grained credit assignment. Extensive experiments on reasoning and code generation benchmarks demonstrate that Graph-GRPO significantly outperforms state-of-the-art baselines, achieving superior training stability and identifying critical communication pathways previously obscured by reward noise.
Related papers
- MAESTRO: Meta-learning Adaptive Estimation of Scalarization Trade-offs for Reward Optimization [56.074760766965085]
Group-Relative Policy Optimization has emerged as an efficient paradigm for aligning Large Language Models (LLMs)<n>We propose MAESTRO, which treats reward scalarization as a dynamic latent policy, leveraging the model's terminal hidden states as a semantic bottleneck.<n>We formulate this as a contextual bandit problem within a bi-level optimization framework, where a lightweight Conductor network co-evolves with the policy by utilizing group-relative advantages as a meta-reward signal.
arXiv Detail & Related papers (2026-01-12T05:02:48Z) - GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization [133.27496265096445]
We show how to apply Group Relative Policy Optimization under multi-reward setting without examining its suitability.<n>We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues.<n>GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.
arXiv Detail & Related papers (2026-01-08T18:59:24Z) - DaGRPO: Rectifying Gradient Conflict in Reasoning via Distinctiveness-Aware Group Relative Policy Optimization [20.66452395111739]
We propose Distinctiveness-aware Group Relative Policy Optimization (DaGRPO)<n>DaGRPO incorporates two core mechanisms: (1) Sequence-level Gradient Rectification, which utilizes fine-grained scoring to dynamically mask sample pairs with low distinctiveness; and (2) Off-policy Data Augmentation, which introduces high-quality anchors to recover training signals for challenging tasks.<n>In-depth analysis confirms that DaGRPO effectively mitigates gradient explosion and accelerates the emergence of long-chain reasoning capabilities.
arXiv Detail & Related papers (2025-12-06T07:51:36Z) - Repurposing Synthetic Data for Fine-grained Search Agent Supervision [81.95597592711688]
LLM-based search agents are increasingly trained on entity-centric synthetic data.<n> prevailing training methods discard this rich entity information, relying instead on sparse, outcome-based rewards.<n>We introduce Entity-aware Group Relative Policy Optimization (E-GRPO), a novel framework that formulates a dense entity-aware reward function.
arXiv Detail & Related papers (2025-10-28T17:50:40Z) - Graph Foundation Models: Bridging Language Model Paradigms and Graph Optimization [4.502753947356616]
We introduce the Graph Foundation Model (GFM), the first framework capable of solving all distance-based optimization problems on graph structures.<n>GFM internalizes the graph's complex topological and neural rules, where the connectivity of the structure itself can be treated as the supervisory signal.<n>Our work establishes a new paradigm of adapting the pretrain-transfer framework to graph optimization, opening the door for applying foundation model innovations to Operations Research.
arXiv Detail & Related papers (2025-09-29T04:05:48Z) - COPO: Consistency-Aware Policy Optimization [17.328515578426227]
Reinforcement learning has significantly enhanced the reasoning capabilities of Large Language Models (LLMs) in complex problem-solving tasks.<n>Recently, the introduction of DeepSeek R1 has inspired a surge of interest in leveraging rule-based rewards as a low-cost alternative for computing advantage functions and guiding policy optimization.<n>We propose a consistency-aware policy optimization framework that introduces a structured global reward based on outcome consistency.
arXiv Detail & Related papers (2025-08-06T07:05:18Z) - Aggregation-aware MLP: An Unsupervised Approach for Graph Message-passing [10.93155007218297]
"AMLP" is an unsupervised framework that shifts the paradigm from directly crafting aggregation functions to making adaptive aggregation.<n>Our approach consists of two key steps: First, we utilize a graph reconstruction that facilitates high-order grouping effects, and second, we employ a single-layer network to encode varying degrees of heterophily.
arXiv Detail & Related papers (2025-07-27T04:52:55Z) - DISCO Balances the Scales: Adaptive Domain- and Difficulty-Aware Reinforcement Learning on Imbalanced Data [65.09939942413651]
We propose a principled extension to GRPO that addresses inter-group imbalance with two key innovations.<n> Domain-aware reward scaling counteracts frequency bias by reweighting optimization based on domain prevalence.<n>Difficulty-aware reward scaling leverages prompt-level self-consistency to identify and prioritize uncertain prompts that offer greater learning value.
arXiv Detail & Related papers (2025-05-21T03:43:29Z) - Adaptive Graph of Thoughts: Test-Time Adaptive Reasoning Unifying Chain, Tree, and Graph Structures [0.0]
We introduce Adaptive Graph of Thoughts (AGoT), a dynamic, graph-based inference framework.<n>AGoT enhances Large Language Models (LLMs) reasoning solely at test time.<n>We validate our approach on diverse benchmarks spanning multi-hop retrieval, scientific reasoning, and mathematical problem-solving.
arXiv Detail & Related papers (2025-02-07T16:54:19Z) - In-context Demonstration Matters: On Prompt Optimization for Pseudo-Supervision Refinement [71.60563181678323]
Large language models (LLMs) have achieved great success across diverse tasks, and fine-tuning is sometimes needed to further enhance generation quality.<n>To handle these challenges, a direct solution is to generate high-confidence'' data from unsupervised downstream tasks.<n>We propose a novel approach, pseudo-supervised demonstrations aligned prompt optimization (PAPO) algorithm, which jointly refines both the prompt and the overall pseudo-supervision.
arXiv Detail & Related papers (2024-10-04T03:39:28Z) - Optimal Propagation for Graph Neural Networks [51.08426265813481]
We propose a bi-level optimization approach for learning the optimal graph structure.
We also explore a low-rank approximation model for further reducing the time complexity.
arXiv Detail & Related papers (2022-05-06T03:37:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.