Hierarchy-of-Groups Policy Optimization for Long-Horizon Agentic Tasks
- URL: http://arxiv.org/abs/2602.22817v1
- Date: Thu, 26 Feb 2026 09:58:10 GMT
- Title: Hierarchy-of-Groups Policy Optimization for Long-Horizon Agentic Tasks
- Authors: Shuo He, Lang Feng, Qi Wei, Xin Cheng, Lei Feng, Bo An,
- Abstract summary: Group-based reinforcement learning (RL) has advanced the capabilities of large language models on long-horizon agentic tasks.<n>We find a key issue in estimating stepwise relative advantages, namely context inconsistency, where steps within the same group may differ in their historical contexts.<n>We propose HGPO, which assigns each step to multiple hierarchical groups according to the consistency of historical contexts.
- Score: 23.119173310662365
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Group-based reinforcement learning (RL), such as GRPO, has advanced the capabilities of large language models on long-horizon agentic tasks. To enable more fine-grained policy updates, recent research has increasingly shifted toward stepwise group-based policy optimization, which treats each step in a rollout trajectory independently while using a memory module to retain historical context. However, we find a key issue in estimating stepwise relative advantages, namely context inconsistency, where steps within the same group may differ in their historical contexts. Empirically, we reveal that this issue can lead to severely biased advantage estimation, thereby degrading policy optimization significantly. To address the issue, in this paper, we propose Hierarchy-of-Groups Policy Optimization (HGPO) for long-horizon agentic tasks. Specifically, within a group of rollout trajectories, HGPO assigns each step to multiple hierarchical groups according to the consistency of historical contexts. Then, for each step, HGPO computes distinct advantages within each group and aggregates them with an adaptive weighting scheme. In this way, HGPO can achieve a favorable bias-variance trade-off in stepwise advantage estimation, without extra models or rollouts. Evaluations on two challenging agentic tasks, ALFWorld and WebShop with Qwen2.5-1.5B-Instruct and Qwen2.5-7B-Instruct, show that HGPO significantly outperforms existing agentic RL methods under the same computational constraints. Code is available at https://github.com/langfengQ/verl-agent/tree/master/recipe/hgpo.
Related papers
- iGRPO: Self-Feedback-Driven LLM Reasoning [88.83313431248473]
Large Language Models (LLMs) have shown promise in solving complex mathematical problems, yet they still fall short of producing accurate and consistent solutions.<n>We introduce Iterative Group Relative Policy Optimization (iGRPO), a two-stage extension of GRPO that adds dynamic self-conditioning through model-generated drafts.<n>Under matched rollout budgets, iGRPO consistently outperforms GRPO across base models.
arXiv Detail & Related papers (2026-02-09T18:45:11Z) - TL-GRPO: Turn-Level RL for Reasoning-Guided Iterative Optimization [97.18886232580131]
Large language models have demonstrated strong reasoning capabilities in complex tasks through tool integration.<n>We propose Turn-Level GRPO, a lightweight RL algorithm that performs turn-level group sampling for fine-grained optimization.
arXiv Detail & Related papers (2026-01-23T06:21:33Z) - Orchestrating Tokens and Sequences: Dynamic Hybrid Policy Optimization for RLVR [31.43482175098666]
Reinforcement Learning with Verifiable Rewards (RLVR) offers a promising framework for optimizing large language models in reasoning tasks.<n>Existing RLVR algorithms focus on different granularities, and each has complementary strengths and limitations.<n>We propose Dynamic Hybrid Policy Optimization (DHPO) to bridge GRPO and GSPO within a single clipped surrogate objective.
arXiv Detail & Related papers (2026-01-09T07:57:40Z) - GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization [133.27496265096445]
We show how to apply Group Relative Policy Optimization under multi-reward setting without examining its suitability.<n>We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues.<n>GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.
arXiv Detail & Related papers (2026-01-08T18:59:24Z) - Sample By Step, Optimize By Chunk: Chunk-Level GRPO For Text-to-Image Generation [29.015994347609936]
Group Relative Policy Optimization (GRPO) has shown strong potential for flow-matching-based text-to-image (T2I) generation.<n>We argue that shifting the optimization paradigm from the step level to the chunk level can effectively alleviate these issues.<n>Chunk-GRPO is the first chunk-level GRPO-based approach for T2I generation.
arXiv Detail & Related papers (2025-10-24T15:50:36Z) - Solving the Granularity Mismatch: Hierarchical Preference Learning for Long-Horizon LLM Agents [56.625878022978945]
Large Language Models (LLMs) as autonomous agents are increasingly tasked with solving complex, long-horizon problems.<n>Direct Preference Optimization (DPO) provides a signal that is too coarse for precise credit assignment, while step-level DPO is often too myopic to capture the value of multi-step behaviors.<n>We introduce Hierarchical Preference Learning (HPL), a hierarchical framework that optimize LLM agents by leveraging preference signals at multiple, synergistic granularities.
arXiv Detail & Related papers (2025-09-26T08:43:39Z) - On the Theory and Practice of GRPO: A Trajectory-Corrected Approach with Fast Convergence [2.8165669455824696]
Group Relative Policy Optimization is a critic-free reinforcement learning algorithm.<n>We show that GRPO update rule estimates the policy gradient at the old policy rather than the current one.<n>We propose a new algorithm: Trajectory level Importance Corrected GRPO.
arXiv Detail & Related papers (2025-08-04T19:01:19Z) - DISCO Balances the Scales: Adaptive Domain- and Difficulty-Aware Reinforcement Learning on Imbalanced Data [65.09939942413651]
We propose a principled extension to GRPO that addresses inter-group imbalance with two key innovations.<n> Domain-aware reward scaling counteracts frequency bias by reweighting optimization based on domain prevalence.<n>Difficulty-aware reward scaling leverages prompt-level self-consistency to identify and prioritize uncertain prompts that offer greater learning value.
arXiv Detail & Related papers (2025-05-21T03:43:29Z) - Group-in-Group Policy Optimization for LLM Agent Training [17.243181792126563]
Group-in-Group Policy Optimization (GiGPO) is a novel RL algorithm that achieves fine-grained credit assignment for LLM agents.<n>We evaluate GiGPO on challenging agent benchmarks, including ALFWorld and WebShop, as well as tool-integrated reasoning on search-augmented QA tasks.
arXiv Detail & Related papers (2025-05-16T08:26:59Z) - Multi-Task Off-Policy Learning from Bandit Feedback [54.96011624223482]
We propose a hierarchical off-policy optimization algorithm (HierOPO), which estimates the parameters of the hierarchical model and then acts pessimistically with respect to them.
We prove per-task bounds on the suboptimality of the learned policies, which show a clear improvement over not using the hierarchical model.
Our theoretical and empirical results show a clear advantage of using the hierarchy over solving each task independently.
arXiv Detail & Related papers (2022-12-09T08:26:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.