Related papers: Stronger Together: On-Policy Reinforcement Learning for Collaborative LLMs

Stronger Together: On-Policy Reinforcement Learning for Collaborative LLMs

URL: http://arxiv.org/abs/2510.11062v2
Date: Tue, 14 Oct 2025 05:12:45 GMT
Title: Stronger Together: On-Policy Reinforcement Learning for Collaborative LLMs
Authors: Yujie Zhao, Lanxiang Hu, Yang Wang, Minmin Hou, Hao Zhang, Ke Ding, Jishen Zhao,
Abstract summary: Multi-agent systems (MAS) and reinforcement learning (RL) are widely used to enhance the agentic capabilities of large language models (LLMs)<n>Applying on-policy RL to MAS remains underexplored and presents unique challenges.<n>We propose AT-GRPO, which includes (i) an agent- and turn-wise grouped RL algorithm tailored to MAS and (ii) a training system that supports both single- and multi-policy regimes.
Score: 20.084201133669534
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multi-agent systems (MAS) and reinforcement learning (RL) are widely used to enhance the agentic capabilities of large language models (LLMs). MAS improves task performance through role-based orchestration, while RL uses environmental rewards to learn stronger policies, such as GRPO-style optimization. However, applying on-policy RL to MAS remains underexplored and presents unique challenges. Algorithmically, standard GRPO grouping assumptions break down because prompts vary by role and by turn. System-wise, the training stack must support MAS-workflow rollouts and on-policy updates for both single-policy and multi-policy models. We propose AT-GRPO, which includes (i) an agent- and turn-wise grouped RL algorithm tailored to MAS and (ii) a training system that supports both single- and multi-policy regimes. Across game, planning, coding, and math tasks, AT-GRPO delivers substantial gains. On long-horizon planning, it increases accuracy from a 14.0 to 47.0 percent single-agent RL baseline to 96.0 to 99.5 percent. It also improves reasoning performance, with average gains of 3.87 to 7.62 percent on coding tasks and 9.0 to 17.93 percent on math. Code and environments are available at: https://github.com/pettingllms-ai/PettingLLMs.

Related papers

iGRPO: Self-Feedback-Driven LLM Reasoning [88.83313431248473]
Large Language Models (LLMs) have shown promise in solving complex mathematical problems, yet they still fall short of producing accurate and consistent solutions.<n>We introduce Iterative Group Relative Policy Optimization (iGRPO), a two-stage extension of GRPO that adds dynamic self-conditioning through model-generated drafts.<n>Under matched rollout budgets, iGRPO consistently outperforms GRPO across base models.
arXiv Detail & Related papers (2026-02-09T18:45:11Z)
Dr. MAS: Stable Reinforcement Learning for Multi-Agent LLM Systems [20.971694319263353]
We propose Dr. MAS, a simple and stable RL training recipe for multi-agent LLM systems.<n>Dr. MAS uses agent-wise remedy: normalizing advantages per agent using each agent's own reward statistics.<n>Dr. MAS provides an end-to-end RL training framework for multi-agent LLM systems.
arXiv Detail & Related papers (2026-02-09T16:13:39Z)
RLAnything: Forge Environment, Policy, and Reward Model in Completely Dynamic RL System [52.3348044324205]
We propose RLAnything, a reinforcement learning framework that forges environment, policy, and reward models through closed-loop optimization.<n>Specifically, the policy is trained with integrated feedback from step-wise and outcome signals.<n>Our theory-motivated automatic environment adaptation improves training for both the reward and policy models.
arXiv Detail & Related papers (2026-02-02T18:59:04Z)
Multi-Agent Deep Research: Training Multi-Agent Systems with M-GRPO [24.532870400949424]
Current training methods train a unified large language model for all agents in the system.<n>This may limit the performances due to different underlying distributions for different agents.<n>We propose M-GRPO, a hierarchical extension of Group Relative Policy Optimization for vertical Multi-agent systems.
arXiv Detail & Related papers (2025-11-17T12:06:30Z)
Multi-Agent Tool-Integrated Policy Optimization [67.12841355267678]
Large language models (LLMs) increasingly rely on multi-turn tool-integrated planning for knowledge-intensive and complex reasoning tasks.<n>Existing implementations typically rely on a single agent, but they suffer from limited context length and noisy tool responses.<n>No existing methods support effective reinforcement learning post-training of tool-integrated multi-agent frameworks.
arXiv Detail & Related papers (2025-10-06T10:44:04Z)
AgentRL: Scaling Agentic Reinforcement Learning with a Multi-Turn, Multi-Task Framework [76.96794548655292]
Large language models (LLMs) have sparked growing interest in building generalist agents that can learn through online interactions.<n>Applying reinforcement learning (RL) to train LLM agents in multi-turn, multi-task settings remains challenging due to lack of scalable infrastructure and stable training algorithms.<n>We present the AgentRL framework for scalable multi-turn, multi-task agentic RL training.
arXiv Detail & Related papers (2025-10-05T13:40:01Z)
From Supervision to Exploration: What Does Protein Language Model Learn During Reinforcement Learning? [76.288870982181]
Protein language models (PLMs) have advanced computational protein science through large-scale pretraining and scalable architectures.<n> reinforcement learning (RL) has broadened exploration and enabled precise multi-objective optimization in protein design.<n>We ask if RL improves sampling efficiency and, more importantly, if it reveals capabilities not captured by supervised learning.
arXiv Detail & Related papers (2025-10-02T01:31:10Z)
Heterogeneous Group-Based Reinforcement Learning for LLM-based Multi-Agent Systems [25.882461853973897]
We propose Multi-Agent Heterogeneous Group Policy Optimization (MHGPO), which guides policy updates by estimating relative reward advantages.<n>MHGPO eliminates the need for Critic networks, enhancing stability and reducing computational overhead.<n>We also introduce three group rollout sampling strategies that trade off between efficiency and effectiveness.
arXiv Detail & Related papers (2025-06-03T10:17:19Z)
Multiple Weaks Win Single Strong: Large Language Models Ensemble Weak Reinforcement Learning Agents into a Supreme One [28.264011412168347]
Model ensemble is a useful approach in reinforcement learning (RL) for training effective agents.<n>We propose LLM-Ens, a novel approach that enhances RL model ensemble with task-specific semantic understandings.
arXiv Detail & Related papers (2025-05-21T09:35:43Z)
SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks [110.20297293596005]
Large language model (LLM) agents need to perform multi-turn interactions in real-world tasks.<n>Existing multi-turn RL algorithms for optimizing LLM agents fail to perform effective credit assignment over multiple turns while leveraging the generalization capabilities of LLMs.<n>We propose a novel RL algorithm, SWEET-RL, that uses a carefully designed optimization objective to train a critic model with access to additional training-time information.<n>Our experiments demonstrate that SWEET-RL achieves a 6% absolute improvement in success and win rates on ColBench compared to other state-of-the-art multi-turn RL algorithms.
arXiv Detail & Related papers (2025-03-19T17:55:08Z)
MALT: Improving Reasoning with Multi-Agent LLM Training [67.76186488361685]
MALT (Multi-Agent LLM Training) is a novel post-training strategy that divides the reasoning process into generation, verification, and refinement steps.<n>On MATH, GSM8K, and CSQA, MALT surpasses the same baseline LLM with a relative improvement of 15.66%, 7.42%, and 9.40% respectively.
arXiv Detail & Related papers (2024-12-02T19:30:36Z)
ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL [80.10358123795946]
We develop a framework for building multi-turn RL algorithms for fine-tuning large language models. Our framework adopts a hierarchical RL approach and runs two RL algorithms in parallel. Empirically, we find that ArCHer significantly improves efficiency and performance on agent tasks.
arXiv Detail & Related papers (2024-02-29T18:45:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.