Related papers: WebAnchor: Anchoring Agent Planning to Stabilize Long-Horizon Web Reasoning

WebAnchor: Anchoring Agent Planning to Stabilize Long-Horizon Web Reasoning

URL: http://arxiv.org/abs/2601.03164v2
Date: Wed, 07 Jan 2026 02:00:39 GMT
Title: WebAnchor: Anchoring Agent Planning to Stabilize Long-Horizon Web Reasoning
Authors: Xinmiao Yu, Liwen Zhang, Xiaocheng Feng, Yong Jiang, Bing Qin, Pengjun Xie, Jingren Zhou,
Abstract summary: Large Language Model(LLM)-based agents have shown strong capabilities in web information seeking.<n>Plan anchor is where the first reasoning step disproportionately impacts downstream behavior in long-horizon web reasoning tasks.<n>We propose Anchor-GRPO, a two-stage RL framework that decouples planning and execution.
Score: 82.12501258760814
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Model(LLM)-based agents have shown strong capabilities in web information seeking, with reinforcement learning (RL) becoming a key optimization paradigm. However, planning remains a bottleneck, as existing methods struggle with long-horizon strategies. Our analysis reveals a critical phenomenon, plan anchor, where the first reasoning step disproportionately impacts downstream behavior in long-horizon web reasoning tasks. Current RL algorithms, fail to account for this by uniformly distributing rewards across the trajectory. To address this, we propose Anchor-GRPO, a two-stage RL framework that decouples planning and execution. In Stage 1, the agent optimizes its first-step planning using fine-grained rubrics derived from self-play experiences and human calibration. In Stage 2, execution is aligned with the initial plan through sparse rewards, ensuring stable and efficient tool usage. We evaluate Anchor-GRPO on four benchmarks: BrowseComp, BrowseComp-Zh, GAIA, and XBench-DeepSearch. Across models from 3B to 30B, Anchor-GRPO outperforms baseline GRPO and First-step GRPO, improving task success and tool efficiency. Notably, WebAnchor-30B achieves 46.0% pass@1 on BrowseComp and 76.4% on GAIA. Anchor-GRPO also demonstrates strong scalability, getting higher accuracy as model size and context length increase.

Related papers

iGRPO: Self-Feedback-Driven LLM Reasoning [88.83313431248473]
Large Language Models (LLMs) have shown promise in solving complex mathematical problems, yet they still fall short of producing accurate and consistent solutions.<n>We introduce Iterative Group Relative Policy Optimization (iGRPO), a two-stage extension of GRPO that adds dynamic self-conditioning through model-generated drafts.<n>Under matched rollout budgets, iGRPO consistently outperforms GRPO across base models.
arXiv Detail & Related papers (2026-02-09T18:45:11Z)
TL-GRPO: Turn-Level RL for Reasoning-Guided Iterative Optimization [97.18886232580131]
Large language models have demonstrated strong reasoning capabilities in complex tasks through tool integration.<n>We propose Turn-Level GRPO, a lightweight RL algorithm that performs turn-level group sampling for fine-grained optimization.
arXiv Detail & Related papers (2026-01-23T06:21:33Z)
TreeGRPO: Tree-Advantage GRPO for Online RL Post-Training of Diffusion Models [14.130608036489336]
Reinforcement learning (RL) post-training is crucial for aligning generative models with human preferences, but its prohibitive computational cost remains a major barrier to widespread adoption.<n>We introduce textbfTreeGRPO, a novel RL framework that dramatically improves training efficiency by recasting the denoising process as a search tree.
arXiv Detail & Related papers (2025-12-09T01:17:34Z)
Enhancing Agentic RL with Progressive Reward Shaping and Value-based Sampling Policy Optimization [13.475938754147625]
Large Language Models (LLMs) empowered with Tool-Integrated Reasoning (TIR) can iteratively plan, call external tools, and integrate returned information to solve complex, long-horizon reasoning tasks.<n>Agentic Reinforcement Learning (Agentic RL) optimize such models over full tool-interaction trajectories.<n>Two key challenges hinder effectiveness: (1) Sparse, non-instructive rewards, such as binary 0-1 verifiable signals, provide limited guidance for intermediate steps and slow convergence.<n>We propose two complementary techniques: Progressive Reward Shaping (PRS) and Value-based Sampling Policy Optimization (VSPO).
arXiv Detail & Related papers (2025-12-08T11:59:25Z)
Hi-Agent: Hierarchical Vision-Language Agents for Mobile Device Control [72.43808515668947]
We introduce Hi-Agent, a trainable hierarchical vision-language agent for mobile control.<n>Hi-Agent features a high-level reasoning model and a low-level action model that are jointly optimized.<n>Hi-Agent achieves a new State-Of-The-Art (SOTA) 87.9% task success rate on the Android-in-the-Wild (AitW) benchmark.
arXiv Detail & Related papers (2025-10-16T07:38:21Z)
In-the-Flow Agentic System Optimization for Effective Planning and Tool Use [73.72524040856052]
AgentFlow is a trainable, in-the-flow agentic framework that coordinates four modules (planner, executor, verifier, generator) through an evolving memory.<n>Flow-GRPO tackles long-horizon, sparse-reward credit assignment by converting multi-turn optimization into a sequence of tractable single-turn policy updates.<n>AgentFlow with a 7B-scale backbone outperforms top-performing baselines with average accuracy gains of 14.9% on search, 14.0% on agentic, 14.5% on mathematical, and 4.1% on scientific tasks.
arXiv Detail & Related papers (2025-10-07T05:32:44Z)
TGRPO :Fine-tuning Vision-Language-Action Model via Trajectory-wise Group Relative Policy Optimization [12.061547251822326]
Trajectory-based Group Relative Policy Optimization (TGRPO) is an online RL-based training framework for Visual-Language-Action (VLA) models.<n>We show that TGRPO achieves an average success rate of 80.7%, which is 4.2% higher than that of Supervised Fine-Tuning (SFT) and outperforms other representative RL-based post-training methods.
arXiv Detail & Related papers (2025-06-10T04:27:49Z)
Group-in-Group Policy Optimization for LLM Agent Training [17.243181792126563]
Group-in-Group Policy Optimization (GiGPO) is a novel RL algorithm that achieves fine-grained credit assignment for LLM agents.<n>We evaluate GiGPO on challenging agent benchmarks, including ALFWorld and WebShop, as well as tool-integrated reasoning on search-augmented QA tasks.
arXiv Detail & Related papers (2025-05-16T08:26:59Z)
DARS: Dynamic Action Re-Sampling to Enhance Coding Agent Performance by Adaptive Tree Traversal [55.13854171147104]
Large Language Models (LLMs) have revolutionized various domains, including natural language processing, data analysis, and software development.<n>We present Dynamic Action Re-Sampling (DARS), a novel inference time compute scaling approach for coding agents.<n>We evaluate our approach on SWE-Bench Lite benchmark, demonstrating that this scaling strategy achieves a pass@k score of 55% with Claude 3.5 Sonnet V2.
arXiv Detail & Related papers (2025-03-18T14:02:59Z)
Bigger, Regularized, Optimistic: scaling for compute and sample-efficient continuous control [1.1404490220482764]
BRO is a model-free algorithm to achieve near-optimal policies in the Dog and Humanoid tasks.<n>BRO achieves state-of-the-art results, significantly outperforming the leading model-based and model-free algorithms.<n>BRO is the first model-free algorithm to achieve near-optimal policies in the notoriously challenging Dog and Humanoid tasks.
arXiv Detail & Related papers (2024-05-25T09:53:25Z)
Maximize to Explore: One Objective Function Fusing Estimation, Planning, and Exploration [87.53543137162488]
We propose an easy-to-implement online reinforcement learning (online RL) framework called textttMEX. textttMEX integrates estimation and planning components while balancing exploration exploitation automatically. It can outperform baselines by a stable margin in various MuJoCo environments with sparse rewards.
arXiv Detail & Related papers (2023-05-29T17:25:26Z)
Mastering the Unsupervised Reinforcement Learning Benchmark from Pixels [112.63440666617494]
Reinforcement learning algorithms can succeed but require large amounts of interactions between the agent and the environment. We propose a new method to solve it, using unsupervised model-based RL, for pre-training the agent. We show robust performance on the Real-Word RL benchmark, hinting at resiliency to environment perturbations during adaptation.
arXiv Detail & Related papers (2022-09-24T14:22:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.