DeepResearch-9K: A Challenging Benchmark Dataset of Deep-Research Agent
- URL: http://arxiv.org/abs/2603.01152v1
- Date: Sun, 01 Mar 2026 15:36:10 GMT
- Title: DeepResearch-9K: A Challenging Benchmark Dataset of Deep-Research Agent
- Authors: Tongzhou Wu, Yuhao Wang, Xinyu Ma, Xiuqiang He, Shuaiqiang Wang, Dawei Yin, Xiangyu Zhao,
- Abstract summary: DeepResearch-9K is a large-scale, challenging dataset for deep-research scenarios.<n>DeepResearch-R1 is an open-source training framework for deep-research agents.
- Score: 63.52637950356965
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep-research agents are capable of executing multi-step web exploration, targeted retrieval, and sophisticated question answering. Despite their powerful capabilities, deep-research agents face two critical bottlenecks: (1) the lack of large-scale, challenging datasets with real-world difficulty, and (2) the absence of accessible, open-source frameworks for data synthesis and agent training. To bridge these gaps, we first construct DeepResearch-9K, a large-scale challenging dataset specifically designed for deep-research scenarios built from open-source multi-hop question-answering (QA) datasets via a low-cost autonomous pipeline. Notably, it consists of (1) 9000 questions spanning three difficulty levels from L1 to L3 (2) high-quality search trajectories with reasoning chains from Tongyi-DeepResearch-30B-A3B, a state-of-the-art deep-research agent, and (3) verifiable answers. Furthermore, we develop an open-source training framework DeepResearch-R1 that supports (1) multi-turn web interactions, (2) different reinforcement learning (RL) approaches, and (3) different reward models such as rule-based outcome reward and LLM-as-judge feedback. Finally, empirical results demonstrate that agents trained on DeepResearch-9K under our DeepResearch-R1 achieve state-of-the-art results on challenging deep-research benchmarks. We release the DeepResearch-9K dataset on https://huggingface.co/datasets/artillerywu/DeepResearch-9K and the code of DeepResearch-R1 on https://github.com/Applied-Machine-Learning-Lab/DeepResearch-R1.
Related papers
- AgentIR: Reasoning-Aware Retrieval for Deep Research Agents [76.29382561831105]
Deep Research agents generate explicit natural language reasoning before each search call.<n> Reasoning-Aware Retrieval embeds the agent's reasoning trace alongside its query.<n>DR- Synth generates Deep Research retriever training data from standard QA datasets.<n>AgentIR-4B achieves 68% accuracy with the open-weight agent Tongyi-DeepResearch.
arXiv Detail & Related papers (2026-03-04T18:47:26Z) - MM-DeepResearch: A Simple and Effective Multimodal Agentic Search Baseline [26.19213349415094]
We aim to develop a multimodal research agent capable of explicit reasoning and planning, multi-tool invocation, and cross-modal information synthesis.<n>We observe three main challenges in developing such agents: (1) scarcity of search-intensive multimodal QA data, (2) lack of effective search trajectories, and (3) prohibitive cost of training with online search APIs.<n>With the three designs, we develop MM-DeepResearch, a powerful multimodal deep research agent, and extensive results shows its superiority across benchmarks.
arXiv Detail & Related papers (2026-03-01T11:13:22Z) - Tongyi DeepResearch Technical Report [111.78446943571782]
To incentivize autonomous deep research agency, Tongyi DeepResearch is developed through an end-to-end training framework.<n>Tongyi DeepResearch achieves 30.5 billion total parameters, with only 3.3 billion activated per token.<n>We open-source the model, framework, and complete solutions to empower the community.
arXiv Detail & Related papers (2025-10-28T17:53:02Z) - DeepWideSearch: Benchmarking Depth and Width in Agentic Information Seeking [42.413184411326164]
DeepWideSearch is the first benchmark designed to evaluate agents to integrate depth and width in information seeking.<n>In DeepWideSearch, agents must process a large volume of data, each requiring deep reasoning over multi-hop retrieval paths.<n>Experiments demonstrate that even state-of-the-art agents achieve only 2.39% average success rate.
arXiv Detail & Related papers (2025-10-23T03:28:45Z) - Fathom-DeepResearch: Unlocking Long Horizon Information Retrieval and Synthesis for SLMs [7.3517692707289415]
We introduce Fathom-DeepResearch, an agentic system composed of two specialized models.<n>The first is Fathom-Search-4B, a DeepSearch model optimized for evidence-based investigation through live web search and targeted webpage querying.<n>The second is Fathom- Synthesizer-4B, trained from Qwen3-4B, which converts multi-turn DeepSearch traces into structured, citation-dense DeepResearch Reports.
arXiv Detail & Related papers (2025-09-28T22:58:11Z) - DeepDive: Advancing Deep Search Agents with Knowledge Graphs and Multi-Turn RL [60.47878242100153]
We present DeepDive to advance deep search agents.<n>We propose a strategy to automatically synthesize complex, difficult, and hard-to-find questions from open knowledge graphs.<n>We apply end-to-end multi-turn reinforcement learning to enhance LLMs' long-horizon reasoning with deep search.
arXiv Detail & Related papers (2025-09-12T17:52:35Z) - HierSearch: A Hierarchical Enterprise Deep Search Framework Integrating Local and Web Searches [54.65565885083031]
We propose a hierarchical agentic deep search framework, HierSearch, trained with hierarchical RL.<n>At the low level, a local deep search agent and a Web deep search agent are trained to retrieve evidence from their corresponding domains.<n>At the high level, a planner agent coordinates low-level agents and provides the final answer.
arXiv Detail & Related papers (2025-08-11T15:31:47Z) - BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent [74.10138164281618]
BrowseComp-Plus is a benchmark derived from BrowseComp, employing a fixed, carefully curated corpus.<n>This benchmark allows comprehensive evaluation and disentangled analysis of deep research agents and retrieval methods.
arXiv Detail & Related papers (2025-08-08T17:55:11Z) - DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments [20.498100965239818]
We introduce DeepResearcher, the first comprehensive framework for end-to-end training of LLM-based deep research agents.<n>Unlike RAG-based approaches that assume all necessary information exists within a fixed corpus, our method trains agents to navigate the noisy, unstructured, and dynamic nature of the open web.<n>Extensive experiments on open-domain research tasks demonstrate that DeepResearcher achieves substantial improvements of up to 28.9 points over prompt engineering-based baselines.
arXiv Detail & Related papers (2025-04-04T04:41:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.