Related papers: IterResearch: Rethinking Long-Horizon Agents via Markovian State Reconstruction

IterResearch: Rethinking Long-Horizon Agents via Markovian State Reconstruction

URL: http://arxiv.org/abs/2511.07327v1
Date: Mon, 10 Nov 2025 17:30:08 GMT
Title: IterResearch: Rethinking Long-Horizon Agents via Markovian State Reconstruction
Authors: Guoxin Chen, Zile Qiao, Xuanzhong Chen, Donglei Yu, Haotian Xu, Wayne Xin Zhao, Ruihua Song, Wenbiao Yin, Huifeng Yin, Liwen Zhang, Kuan Li, Minpeng Liao, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou,
Abstract summary: IterResearch is a novel iterative deep-research paradigm that reformulates long-horizon research as a Markov Decision Process.<n>It achieves substantial improvements over existing open-source agents with average +14.5pp across six benchmarks.<n>It serves as an effective prompting strategy, improving frontier models by up to 19.2pp over ReAct on long-horizon tasks.
Score: 107.49922328855025
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in deep-research agents have shown promise for autonomous knowledge construction through dynamic reasoning over external sources. However, existing approaches rely on a mono-contextual paradigm that accumulates all information in a single, expanding context window, leading to context suffocation and noise contamination that limit their effectiveness on long-horizon tasks. We introduce IterResearch, a novel iterative deep-research paradigm that reformulates long-horizon research as a Markov Decision Process with strategic workspace reconstruction. By maintaining an evolving report as memory and periodically synthesizing insights, our approach preserves consistent reasoning capacity across arbitrary exploration depths. We further develop Efficiency-Aware Policy Optimization (EAPO), a reinforcement learning framework that incentivizes efficient exploration through geometric reward discounting and enables stable distributed training via adaptive downsampling. Extensive experiments demonstrate that IterResearch achieves substantial improvements over existing open-source agents with average +14.5pp across six benchmarks and narrows the gap with frontier proprietary systems. Remarkably, our paradigm exhibits unprecedented interaction scaling, extending to 2048 interactions with dramatic performance gains (from 3.5\% to 42.5\%), and serves as an effective prompting strategy, improving frontier models by up to 19.2pp over ReAct on long-horizon tasks. These findings position IterResearch as a versatile solution for long-horizon reasoning, effective both as a trained agent and as a prompting paradigm for frontier models.

Related papers

Think Longer to Explore Deeper: Learn to Explore In-Context via Length-Incentivized Reinforcement Learning [53.58654277639939]
In-context exploration is the intrinsic ability to generate, verify, and refine hypotheses within a single continuous context.<n>We propose Length-Incentivized Exploration, which explicitly encourages models to explore more.<n>Our method achieves an average improvement of 4.4% on in-domain tasks and a 2.7% gain on out-of-domain benchmarks.
arXiv Detail & Related papers (2026-02-12T09:24:32Z)
Deep Researcher with Sequential Plan Reflection and Candidates Crossover (Deep Researcher Reflect Evolve) [0.0]
This paper introduces a novel Deep Researcher architecture designed to generate detailed research reports on complex PhD level topics.<n>Our system utilizes two key innovations: Sequential Research Plan Refinement via Reflection and a Candidates Crossover algorithm.<n>Our architecture achieved an overall score of 46.21, demonstrating superior performance by surpassing leading deep research agents.
arXiv Detail & Related papers (2026-01-28T18:45:39Z)
Step-DeepResearch Technical Report [90.50586290399683]
We introduce Step-DeepResearch, a cost-effective, end-to-end agent.<n>We propose a Data Synthesis Strategy Based on Atomic Capabilities to reinforce planning and report writing.<n>To bridge the evaluation gap in the Chinese domain, we establish ADR-Bench for realistic deep research scenarios.
arXiv Detail & Related papers (2025-12-23T16:32:27Z)
Thinking Forward and Backward: Multi-Objective Reinforcement Learning for Retrieval-Augmented Reasoning [137.33138614095435]
Retrieval-augmented generation (RAG) has proven to be effective in mitigating hallucinations in large language models.<n>Recent efforts have incorporated search-based interactions into RAG, enabling iterative reasoning with real-time retrieval.<n>We propose Bi-RAR, a novel retrieval-augmented reasoning framework that evaluates each intermediate step jointly in both forward and backward directions.
arXiv Detail & Related papers (2025-11-12T08:29:39Z)
Beyond Turn Limits: Training Deep Search Agents with Dynamic Context Window [88.85901839023803]
DeepMiner is a novel framework that elicits such abilities by introducing high-difficulty training tasks and dynamic context window.<n>We develop DeepMiner-32B, which achieves substantial performance improvements across multiple search agent benchmarks.
arXiv Detail & Related papers (2025-10-09T14:31:39Z)
Efficient On-Policy Reinforcement Learning via Exploration of Sparse Parameter Space [15.65017469378437]
Policy-gradient methods such as PPO are updated along a single gradient direction, leaving the rich local structure of the parameter space unexplored.<n>Previous work has shown that the surrogate gradient is often poorly correlated with the true reward landscape.<n>We introduce ExploRLer, a pluggable pipeline that seamlessly integrates with on-policy algorithms such as PPO and TRPO.
arXiv Detail & Related papers (2025-09-30T07:13:55Z)
DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search [53.27052683356095]
We present DeepSearch, a framework that integrates Monte Carlo Tree Search directly into RLVR training.<n>In contrast to existing methods that rely on tree search only at inference, DeepSearch embeds structured search into the training loop.<n>Our contributions include: (1) a global frontier selection strategy that prioritizes promising nodes across the search tree, (2) selection with entropy-based guidance that identifies confident paths for supervision, and (3) adaptive replay buffer training with solution caching for efficiency.
arXiv Detail & Related papers (2025-09-29T20:00:29Z)
WebResearcher: Unleashing unbounded reasoning capability in Long-Horizon Agents [72.28593628378991]
WebResearcher is an iterative deep-research paradigm that reformulates deep research as a Markov Decision Process.<n>WebResearcher achieves state-of-the-art performance, even surpassing frontier proprietary systems.
arXiv Detail & Related papers (2025-09-16T17:57:17Z)
Deep Research: A Survey of Autonomous Research Agents [33.96146020332329]
The rapid advancement of large language models (LLMs) has driven the development of agentic systems capable of autonomously performing complex tasks.<n>To overcome these limitations, the paradigm of deep research has been proposed, wherein agents actively engage in planning, retrieval, and synthesis to generate comprehensive and faithful analytical reports grounded in web-based evidence.<n>We provide a systematic overview of the deep research pipeline, which comprises four core stages: planning, question developing, web exploration, and report generation.
arXiv Detail & Related papers (2025-08-18T09:26:14Z)
Learning to Plan Optimistically: Uncertainty-Guided Deep Exploration via Latent Model Ensembles [73.15950858151594]
This paper presents Latent Optimistic Value Exploration (LOVE), a strategy that enables deep exploration through optimism in the face of uncertain long-term rewards. We combine latent world models with value function estimation to predict infinite-horizon returns and recover associated uncertainty via ensembling. We apply LOVE to visual robot control tasks in continuous action spaces and demonstrate on average more than 20% improved sample efficiency in comparison to state-of-the-art and other exploration objectives.
arXiv Detail & Related papers (2020-10-27T22:06:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.