Related papers: FS-Researcher: Test-Time Scaling for Long-Horizon Research Tasks with File-System-Based Agents

FS-Researcher: Test-Time Scaling for Long-Horizon Research Tasks with File-System-Based Agents

URL: http://arxiv.org/abs/2602.01566v1
Date: Mon, 02 Feb 2026 03:00:19 GMT
Title: FS-Researcher: Test-Time Scaling for Long-Horizon Research Tasks with File-System-Based Agents
Authors: Chiwei Zhu, Benfeng Xu, Mingxuan Du, Shaohan Wang, Xiaorui Wang, Zhendong Mao, Yongdong Zhang,
Abstract summary: We introduce FS-Researcher, a file-system-based framework that scales deep research beyond the context window via a persistent workspace.<n>A Context Builder agent browses the internet, writes structured notes, and archives raw sources into a hierarchical knowledge base that can grow far beyond context length.<n>A Report Writer agent then composes the final report section by section, treating the knowledge base as the source of facts.
Score: 53.03492387564392
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Deep research is emerging as a representative long-horizon task for large language model (LLM) agents. However, long trajectories in deep research often exceed model context limits, compressing token budgets for both evidence collection and report writing, and preventing effective test-time scaling. We introduce FS-Researcher, a file-system-based, dual-agent framework that scales deep research beyond the context window via a persistent workspace. Specifically, a Context Builder agent acts as a librarian which browses the internet, writes structured notes, and archives raw sources into a hierarchical knowledge base that can grow far beyond context length. A Report Writer agent then composes the final report section by section, treating the knowledge base as the source of facts. In this framework, the file system serves as a durable external memory and a shared coordination medium across agents and sessions, enabling iterative refinement beyond the context window. Experiments on two open-ended benchmarks (DeepResearch Bench and DeepConsult) show that FS-Researcher achieves state-of-the-art report quality across different backbone models. Further analyses demonstrate a positive correlation between final report quality and the computation allocated to the Context Builder, validating effective test-time scaling under the file-system paradigm. The code and data are anonymously open-sourced at https://github.com/Ignoramus0817/FS-Researcher.

Related papers

AgentCPM-Report: Interleaving Drafting and Deepening for Open-Ended Deep Research [85.51475655916026]
AgentCPM-Report is a lightweight yet high-performing local solution composed of a framework that mirrors the human writing process.<n>Our framework uses a Writing As Reasoning Policy (WARP), which enables models to dynamically revise outlines.<n>Experiments on DeepResearch Bench, DeepConsult, and DeepResearch Gym demonstrate that AgentCPM-Report outperforms leading closed-source systems.
arXiv Detail & Related papers (2026-02-06T09:45:04Z)
InfiAgent: An Infinite-Horizon Framework for General-Purpose Autonomous Agents [36.740230738304525]
InfiAgent keeps the agent's reasoning context strictly bounded regardless of task duration.<n>InfiAgent with a 20B open-source model is competitive with larger proprietary systems.
arXiv Detail & Related papers (2026-01-06T17:35:57Z)
LongDA: Benchmarking LLM Agents for Long-Document Data Analysis [55.32211515932351]
LongDA targets real-world settings in which navigating long documentation and complex data is the primary bottleneck.<n>LongTA is a tool-augmented agent framework that enables document access, retrieval, and code execution.<n>Our experiments reveal substantial performance gaps even among state-of-the-art models.
arXiv Detail & Related papers (2026-01-05T23:23:16Z)
BookRAG: A Hierarchical Structure-aware Index-based Approach for Retrieval-Augmented Generation on Complex Documents [11.158307125677375]
Retrieval-Augmented Generation (RAG) queries highly relevant information from external complex documents.<n>We introduce BookRAG, a novel RAG approach targeted for documents with a hierarchical structure.<n>BookRAG achieves state-of-the-art performance, significantly outperforming baselines in both retrieval recall and QA accuracy.
arXiv Detail & Related papers (2025-12-03T03:40:49Z)
Resolving Evidence Sparsity: Agentic Context Engineering for Long-Document Understanding [49.26132236798123]
Vision Language Models (VLMs) have gradually become a primary approach in document understanding.<n>We propose SLEUTH, a multi agent framework that orchestrates a retriever and four collaborative agents in a coarse to fine process.<n>The framework identifies key textual and visual clues within the retrieved pages, filters for salient visual evidence such as tables and charts, and analyzes the query to devise a reasoning strategy.
arXiv Detail & Related papers (2025-11-28T03:09:40Z)
SurveyG: A Multi-Agent LLM Framework with Hierarchical Citation Graph for Automated Survey Generation [4.512335376984058]
Large language models (LLMs) are increasingly adopted for automating survey paper generation.<n>We propose textbfSurveyG, an LLM-based agent framework that integrates textithierarchical citation graph<n>The graph is organized into three layers: textbfFoundation, textbfDevelopment, and textbfFrontier, to capture the evolution of research from seminal works to incremental advances and emerging directions.
arXiv Detail & Related papers (2025-10-09T03:14:20Z)
WebWeaver: Structuring Web-Scale Evidence with Dynamic Outlines for Open-Ended Deep Research [73.58638285105971]
This paper tackles textbfopen-ended deep research (OEDR), a complex challenge where AI agents must synthesize vast web-scale information into insightful reports.<n>We introduce textbfWebWeaver, a novel dual-agent framework that emulates the human research process.<n>Our framework establishes a new state-of-the-art across major OEDR benchmarks, including DeepResearch Bench, DeepConsult, and DeepResearchGym.
arXiv Detail & Related papers (2025-09-16T17:57:21Z)
Benchmarking Deep Search over Heterogeneous Enterprise Data [73.55304268238474]
We present a new benchmark for evaluating a form of retrieval-augmented generation (RAG)<n>RAG requires source-aware, multi-hop reasoning over diverse, sparsed, but related sources.<n>We build it using a synthetic data pipeline that simulates business across product planning, development, and support stages.
arXiv Detail & Related papers (2025-06-29T08:34:59Z)
ELITE: Embedding-Less retrieval with Iterative Text Exploration [5.8851517822935335]
Large Language Models (LLMs) have achieved impressive progress in natural language processing.<n>Their limited ability to retain long-term context constrains performance on document-level or multi-turn tasks.
arXiv Detail & Related papers (2025-05-17T08:48:43Z)
KILT: a Benchmark for Knowledge Intensive Language Tasks [102.33046195554886]
We present a benchmark for knowledge-intensive language tasks (KILT) All tasks in KILT are grounded in the same snapshot of Wikipedia. We find that a shared dense vector index coupled with a seq2seq model is a strong baseline.
arXiv Detail & Related papers (2020-09-04T15:32:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.