Related papers: Solving the Granularity Mismatch: Hierarchical Preference Learning for Long-Horizon LLM Agents

Solving the Granularity Mismatch: Hierarchical Preference Learning for Long-Horizon LLM Agents

URL: http://arxiv.org/abs/2510.03253v1
Date: Fri, 26 Sep 2025 08:43:39 GMT
Title: Solving the Granularity Mismatch: Hierarchical Preference Learning for Long-Horizon LLM Agents
Authors: Heyang Gao, Zexu Sun, Erxue Min, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Xu Chen,
Abstract summary: Large Language Models (LLMs) as autonomous agents are increasingly tasked with solving complex, long-horizon problems.<n>Direct Preference Optimization (DPO) provides a signal that is too coarse for precise credit assignment, while step-level DPO is often too myopic to capture the value of multi-step behaviors.<n>We introduce Hierarchical Preference Learning (HPL), a hierarchical framework that optimize LLM agents by leveraging preference signals at multiple, synergistic granularities.
Score: 56.625878022978945
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) as autonomous agents are increasingly tasked with solving complex, long-horizon problems. Aligning these agents via preference-based offline methods like Direct Preference Optimization (DPO) is a promising direction, yet it faces a critical granularity mismatch. Trajectory-level DPO provides a signal that is too coarse for precise credit assignment, while step-level DPO is often too myopic to capture the value of multi-step behaviors. To resolve this challenge, we introduce Hierarchical Preference Learning (HPL), a hierarchical framework that optimizes LLM agents by leveraging preference signals at multiple, synergistic granularities. While HPL incorporates trajectory- and step-level DPO for global and local policy stability, its core innovation lies in group-level preference optimization guided by a dual-layer curriculum. Our approach first decomposes expert trajectories into semantically coherent action groups and then generates contrasting suboptimal groups to enable preference learning at a fine-grained, sub-task level. Then, instead of treating all preference pairs equally, HPL introduces a curriculum scheduler that organizes the learning process from simple to complex. This curriculum is structured along two axes: the group length, representing sub-task complexity, and the sample difficulty, defined by the reward gap between preferred and dispreferred action groups. Experiments on three challenging agent benchmarks show that HPL outperforms existing state-of-the-art methods. Our analyses demonstrate that the hierarchical DPO loss effectively integrates preference signals across multiple granularities, while the dual-layer curriculum is crucial for enabling the agent to solve a wide range of tasks, from simple behaviors to complex multi-step sequences.

Related papers

TSPO: Breaking the Double Homogenization Dilemma in Multi-turn Search Policy Optimization [32.17940023097263]
Multi-turn tool-integrated reasoning enables Large Language Models (LLMs) to solve complex tasks through iterative information retrieval.<n>Current reinforcement learning (RL) frameworks for search-augmented reasoning rely on sparse outcome-level rewards.<n>We propose Turn-level Stage-aware Policy Optimization (TSPO) to address this problem.
arXiv Detail & Related papers (2026-01-30T09:58:45Z)
IB-GRPO: Aligning LLM-based Learning Path Recommendation with Educational Objectives via Indicator-Based Group Relative Policy Optimization [20.87328464098245]
Learning Path Recommendation (LPR) aims to generate personalized sequences of learning items that maximize long-term learning effect.<n>LLMs offer rich semantic understanding for free-form recommendation, applying them to long-horizon LPR is challenging.<n>We propose IB-GRPO, an indicator-guided alignment approach for LLM-based LPR.
arXiv Detail & Related papers (2026-01-21T06:03:05Z)
Grounded Test-Time Adaptation for LLM Agents [75.62784644919803]
Large language model (LLM)-based agents struggle to generalize to novel and complex environments.<n>We propose two strategies for adapting LLM agents by leveraging environment-specific information available during deployment.
arXiv Detail & Related papers (2025-11-06T22:24:35Z)
Agentic Reinforced Policy Optimization [66.96989268893932]
Large-scale reinforcement learning with verifiable rewards (RLVR) has demonstrated its effectiveness in harnessing the potential of large language models (LLMs) for single-turn reasoning tasks.<n>Current RL algorithms inadequately balance the models' intrinsic long-horizon reasoning capabilities and their proficiency in multi-turn tool interactions.<n>We propose Agentic Reinforced Policy Optimization (ARPO), a novel agentic RL algorithm tailored for training multi-turn LLM-based agents.
arXiv Detail & Related papers (2025-07-26T07:53:11Z)
Heterogeneous Group-Based Reinforcement Learning for LLM-based Multi-Agent Systems [25.882461853973897]
We propose Multi-Agent Heterogeneous Group Policy Optimization (MHGPO), which guides policy updates by estimating relative reward advantages.<n>MHGPO eliminates the need for Critic networks, enhancing stability and reducing computational overhead.<n>We also introduce three group rollout sampling strategies that trade off between efficiency and effectiveness.
arXiv Detail & Related papers (2025-06-03T10:17:19Z)
MisoDICE: Multi-Agent Imitation from Unlabeled Mixed-Quality Demonstrations [5.4482836906033585]
We study offline imitation learning (IL) in cooperative multi-agent settings, where demonstrations have unlabeled mixed quality.<n>Our proposed solution is structured in two stages: trajectory labeling and multi-agent imitation learning.<n>We introduce MisoDICE, a novel multi-agent IL algorithm that leverages these labels to learn robust policies.
arXiv Detail & Related papers (2025-05-24T08:43:42Z)
Reinforcing Multi-Turn Reasoning in LLM Agents via Turn-Level Reward Design [35.544075583073685]
We present the first systematic study of textitturn-level reward design for multi-turn RL algorithms and agent applications.<n>We conduct case studies on multi-turn reasoning-augmented search agents, where we carefully design two types of turn-level rewards: verifiable and LLM-as-judge.<n>Our experiments on multi-turn search tasks demonstrate that incorporating well-designed turn-level rewards enables RL algorithms to significantly outperform baseline methods with trajectory-level rewards.
arXiv Detail & Related papers (2025-05-17T04:09:46Z)
Group-in-Group Policy Optimization for LLM Agent Training [17.243181792126563]
Group-in-Group Policy Optimization (GiGPO) is a novel RL algorithm that achieves fine-grained credit assignment for LLM agents.<n>We evaluate GiGPO on two challenging agent benchmarks, ALFWorld and WebShop, using Qwen2.5-1.5B-Instruct and Qwen2.5-7B-Instruct.
arXiv Detail & Related papers (2025-05-16T08:26:59Z)
Collab: Controlled Decoding using Mixture of Agents for LLM Alignment [90.6117569025754]
Reinforcement learning from human feedback has emerged as an effective technique to align Large Language models.<n>Controlled Decoding provides a mechanism for aligning a model at inference time without retraining.<n>We propose a mixture of agent-based decoding strategies leveraging the existing off-the-shelf aligned LLM policies.
arXiv Detail & Related papers (2025-03-27T17:34:25Z)
SDPO: Segment-Level Direct Preference Optimization for Social Agents [56.970902914217156]
Social agents powered by large language models (LLMs) can simulate human social behaviors but fall short in handling complex social dialogues.<n>We propose Segment-Level Direct Preference Optimization (SDPO), which dynamically select key segments within interactions to optimize multi-turn agent behavior.
arXiv Detail & Related papers (2025-01-03T14:09:46Z)
Learning Reward and Policy Jointly from Demonstration and Preference Improves Alignment [58.049113055986375]
We develop a single stage approach named Alignment with Integrated Human Feedback (AIHF) to train reward models and the policy.<n>The proposed approach admits a suite of efficient algorithms, which can easily reduce to, and leverage, popular alignment algorithms.<n>We demonstrate the efficiency of the proposed solutions with extensive experiments involving alignment problems in LLMs and robotic control problems in MuJoCo.
arXiv Detail & Related papers (2024-06-11T01:20:53Z)
Entropy-Regularized Token-Level Policy Optimization for Language Agent Reinforcement [67.1393112206885]
Large Language Models (LLMs) have shown promise as intelligent agents in interactive decision-making tasks. We introduce Entropy-Regularized Token-level Policy Optimization (ETPO), an entropy-augmented RL method tailored for optimizing LLMs at the token level. We assess the effectiveness of ETPO within a simulated environment that models data science code generation as a series of multi-step interactive tasks.
arXiv Detail & Related papers (2024-02-09T07:45:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.