Related papers: Group-in-Group Policy Optimization for LLM Agent Training

Group-in-Group Policy Optimization for LLM Agent Training

URL: http://arxiv.org/abs/2505.10978v1
Date: Fri, 16 May 2025 08:26:59 GMT
Title: Group-in-Group Policy Optimization for LLM Agent Training
Authors: Lang Feng, Zhenghai Xue, Tingcong Liu, Bo An,
Abstract summary: Group-in-Group Policy Optimization (GiGPO) is a novel RL algorithm that achieves fine-grained credit assignment for LLM agents.<n>We evaluate GiGPO on two challenging agent benchmarks, ALFWorld and WebShop, using Qwen2.5-1.5B-Instruct and Qwen2.5-7B-Instruct.
Score: 14.179593951503676
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in group-based reinforcement learning (RL) have driven frontier large language models (LLMs) in single-turn tasks like mathematical reasoning. However, their scalability to long-horizon LLM agent training remains limited. Unlike static tasks, agent-environment interactions unfold over many steps and often yield sparse or delayed rewards, making credit assignment across individual steps significantly more challenging. In this work, we propose Group-in-Group Policy Optimization (GiGPO), a novel RL algorithm that achieves fine-grained credit assignment for LLM agents while preserving the appealing properties of group-based RL: critic-free, low memory, and stable convergence. GiGPO introduces a two-level structure for estimating relative advantage: (i) At the episode-level, GiGPO computes macro relative advantages based on groups of complete trajectories; (ii) At the step-level, GiGPO introduces an anchor state grouping mechanism that retroactively constructs step-level groups by identifying repeated environment states across trajectories. Actions stemming from the same state are grouped together, enabling micro relative advantage estimation. This hierarchical structure effectively captures both global trajectory quality and local step effectiveness without relying on auxiliary models or additional rollouts. We evaluate GiGPO on two challenging agent benchmarks, ALFWorld and WebShop, using Qwen2.5-1.5B-Instruct and Qwen2.5-7B-Instruct. Crucially, GiGPO delivers fine-grained per-step credit signals and achieves performance gains of > 12\% on ALFWorld and > 9\% on WebShop over the GRPO baseline: all while maintaining the same GPU memory overhead, identical LLM rollout, and incurring little to no additional time cost.

Related papers

Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for Language Model Programs [77.22973302887435]
Group Relative Policy Optimization (GRPO) has proven to be an effective tool for post-training language models (LMs)<n>We present mmGRPO, a simple multi-module of GRPO that groups LM calls by module across rollouts and handles variable-length and interrupted trajectories.<n>We find that mmGRPO, composed with automatic prompt optimization, improves accuracy by 11% on average across classification, many-hop search, and privacy-preserving delegation tasks.
arXiv Detail & Related papers (2025-08-06T17:28:31Z)
Agentic Reinforced Policy Optimization [66.96989268893932]
Large-scale reinforcement learning with verifiable rewards (RLVR) has demonstrated its effectiveness in harnessing the potential of large language models (LLMs) for single-turn reasoning tasks.<n>Current RL algorithms inadequately balance the models' intrinsic long-horizon reasoning capabilities and their proficiency in multi-turn tool interactions.<n>We propose Agentic Reinforced Policy Optimization (ARPO), a novel agentic RL algorithm tailored for training multi-turn LLM-based agents.
arXiv Detail & Related papers (2025-07-26T07:53:11Z)
GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning [106.98018881499362]
We introduce GEPA (Genetic-Pareto), a prompt that thoroughly incorporates natural language to learn high-level rules from trial and error.<n>GEPA samples system-level trajectories (e.g., reasoning, tool calls, and tool outputs) and reflects on them in natural language to diagnose problems.<n>It can often turn even just a few rollouts into a large quality gain.
arXiv Detail & Related papers (2025-07-25T17:42:32Z)
Heterogeneous Group-Based Reinforcement Learning for LLM-based Multi-Agent Systems [25.882461853973897]
We propose Multi-Agent Heterogeneous Group Policy Optimization (MHGPO), which guides policy updates by estimating relative reward advantages.<n>MHGPO eliminates the need for Critic networks, enhancing stability and reducing computational overhead.<n>We also introduce three group rollout sampling strategies that trade off between efficiency and effectiveness.
arXiv Detail & Related papers (2025-06-03T10:17:19Z)
Accelerating RL for LLM Reasoning with Optimal Advantage Regression [52.0792918455501]
We propose a novel two-stage policy optimization framework that directly approximates the optimal advantage function.<n>$A$*-PO achieves competitive performance across a wide range of mathematical reasoning benchmarks.<n>It reduces training time by up to 2$times$ and peak memory usage by over 30% compared to PPO, GRPO, and REBEL.
arXiv Detail & Related papers (2025-05-27T03:58:50Z)
Token-Efficient RL for LLM Reasoning [0.02488650627593658]
We propose reinforcement learning strategies tailored for reasoning in large language models (LLMs) under strict memory and compute limits.<n>Building on early policy gradient methods with baseline subtraction, we design critic-free methods that operate on a small, informative subset of output tokens.<n>We show that our methods raise accuracy on the SVAMP benchmark from 46% to over 70% and show strong performance on multi-digit multiplication.
arXiv Detail & Related papers (2025-04-29T14:58:43Z)
Collab: Controlled Decoding using Mixture of Agents for LLM Alignment [90.6117569025754]
Reinforcement learning from human feedback has emerged as an effective technique to align Large Language models.<n>Controlled Decoding provides a mechanism for aligning a model at inference time without retraining.<n>We propose a mixture of agent-based decoding strategies leveraging the existing off-the-shelf aligned LLM policies.
arXiv Detail & Related papers (2025-03-27T17:34:25Z)
Low-Rank Agent-Specific Adaptation (LoRASA) for Multi-Agent Policy Learning [3.333453555166201]
Multi-agent reinforcement learning (MARL) often relies on emph parameter sharing (PS) to scale efficiently.<n>We propose textbfLow-Rank Agent-Specific Adaptation (LoRASA), a novel approach that treats each agent's policy as a specialized task'' fine-tuned from a shared backbone.<n>We evaluate LoRASA on challenging benchmarks including the StarCraft Multi-Agent Challenge (SMAC) and Multi-Agent MuJoCo (MAMuJoCo)
arXiv Detail & Related papers (2025-02-08T13:57:53Z)
SDPO: Segment-Level Direct Preference Optimization for Social Agents [56.970902914217156]
Social agents powered by large language models (LLMs) can simulate human social behaviors but fall short in handling complex social dialogues.<n>We propose Segment-Level Direct Preference Optimization (SDPO), which dynamically select key segments within interactions to optimize multi-turn agent behavior.
arXiv Detail & Related papers (2025-01-03T14:09:46Z)
VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment [66.80143024475635]
We propose VinePPO, a straightforward approach to compute unbiased Monte Carlo-based estimates. We show that VinePPO consistently outperforms PPO and other RL-free baselines across MATH and GSM8K datasets.
arXiv Detail & Related papers (2024-10-02T15:49:30Z)
REBEL: Reinforcement Learning via Regressing Relative Rewards [59.68420022466047]
We propose REBEL, a minimalist RL algorithm for the era of generative models.<n>In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL.<n>We find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO.
arXiv Detail & Related papers (2024-04-25T17:20:45Z)
MASP: Scalable GNN-based Planning for Multi-Agent Navigation [18.70078556851899]
Multi-Agent Scalable Graph-based Planner (MASP) is a goal-conditioned hierarchical planner for navigation tasks.<n>MASP employs a hierarchical framework to reduce space complexity by decomposing a large exploration space into multiple goal-conditioned subspaces.<n>For agent cooperation and the adaptation to varying team sizes, we model agents and goals as graphs to better capture their relationship.
arXiv Detail & Related papers (2023-12-05T06:05:04Z)
Maximize to Explore: One Objective Function Fusing Estimation, Planning, and Exploration [87.53543137162488]
We propose an easy-to-implement online reinforcement learning (online RL) framework called textttMEX. textttMEX integrates estimation and planning components while balancing exploration exploitation automatically. It can outperform baselines by a stable margin in various MuJoCo environments with sparse rewards.
arXiv Detail & Related papers (2023-05-29T17:25:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.