Related papers: DeepResearch-9K: A Challenging Benchmark Dataset of Deep-Research Agent

DeepResearch-9K: A Challenging Benchmark Dataset of Deep-Research Agent

URL: http://arxiv.org/abs/2603.01152v1
Date: Sun, 01 Mar 2026 15:36:10 GMT
Title: DeepResearch-9K: A Challenging Benchmark Dataset of Deep-Research Agent
Authors: Tongzhou Wu, Yuhao Wang, Xinyu Ma, Xiuqiang He, Shuaiqiang Wang, Dawei Yin, Xiangyu Zhao,
Abstract summary: DeepResearch-9K is a large-scale, challenging dataset for deep-research scenarios.<n>DeepResearch-R1 is an open-source training framework for deep-research agents.
Score: 63.52637950356965
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deep-research agents are capable of executing multi-step web exploration, targeted retrieval, and sophisticated question answering. Despite their powerful capabilities, deep-research agents face two critical bottlenecks: (1) the lack of large-scale, challenging datasets with real-world difficulty, and (2) the absence of accessible, open-source frameworks for data synthesis and agent training. To bridge these gaps, we first construct DeepResearch-9K, a large-scale challenging dataset specifically designed for deep-research scenarios built from open-source multi-hop question-answering (QA) datasets via a low-cost autonomous pipeline. Notably, it consists of (1) 9000 questions spanning three difficulty levels from L1 to L3 (2) high-quality search trajectories with reasoning chains from Tongyi-DeepResearch-30B-A3B, a state-of-the-art deep-research agent, and (3) verifiable answers. Furthermore, we develop an open-source training framework DeepResearch-R1 that supports (1) multi-turn web interactions, (2) different reinforcement learning (RL) approaches, and (3) different reward models such as rule-based outcome reward and LLM-as-judge feedback. Finally, empirical results demonstrate that agents trained on DeepResearch-9K under our DeepResearch-R1 achieve state-of-the-art results on challenging deep-research benchmarks. We release the DeepResearch-9K dataset on https://huggingface.co/datasets/artillerywu/DeepResearch-9K and the code of DeepResearch-R1 on https://github.com/Applied-Machine-Learning-Lab/DeepResearch-R1.

Related papers

AgentIR: Reasoning-Aware Retrieval for Deep Research Agents [76.29382561831105]
Deep Research agents generate explicit natural language reasoning before each search call.<n> Reasoning-Aware Retrieval embeds the agent's reasoning trace alongside its query.<n>DR- Synth generates Deep Research retriever training data from standard QA datasets.<n>AgentIR-4B achieves 68% accuracy with the open-weight agent Tongyi-DeepResearch.
arXiv Detail & Related papers (2026-03-04T18:47:26Z)
MM-DeepResearch: A Simple and Effective Multimodal Agentic Search Baseline [26.19213349415094]
We aim to develop a multimodal research agent capable of explicit reasoning and planning, multi-tool invocation, and cross-modal information synthesis.<n>We observe three main challenges in developing such agents: (1) scarcity of search-intensive multimodal QA data, (2) lack of effective search trajectories, and (3) prohibitive cost of training with online search APIs.<n>With the three designs, we develop MM-DeepResearch, a powerful multimodal deep research agent, and extensive results shows its superiority across benchmarks.
arXiv Detail & Related papers (2026-03-01T11:13:22Z)
Tongyi DeepResearch Technical Report [111.78446943571782]
To incentivize autonomous deep research agency, Tongyi DeepResearch is developed through an end-to-end training framework.<n>Tongyi DeepResearch achieves 30.5 billion total parameters, with only 3.3 billion activated per token.<n>We open-source the model, framework, and complete solutions to empower the community.
arXiv Detail & Related papers (2025-10-28T17:53:02Z)
DeepWideSearch: Benchmarking Depth and Width in Agentic Information Seeking [42.413184411326164]
DeepWideSearch is the first benchmark designed to evaluate agents to integrate depth and width in information seeking.<n>In DeepWideSearch, agents must process a large volume of data, each requiring deep reasoning over multi-hop retrieval paths.<n>Experiments demonstrate that even state-of-the-art agents achieve only 2.39% average success rate.
arXiv Detail & Related papers (2025-10-23T03:28:45Z)
Fathom-DeepResearch: Unlocking Long Horizon Information Retrieval and Synthesis for SLMs [7.3517692707289415]
We introduce Fathom-DeepResearch, an agentic system composed of two specialized models.<n>The first is Fathom-Search-4B, a DeepSearch model optimized for evidence-based investigation through live web search and targeted webpage querying.<n>The second is Fathom- Synthesizer-4B, trained from Qwen3-4B, which converts multi-turn DeepSearch traces into structured, citation-dense DeepResearch Reports.
arXiv Detail & Related papers (2025-09-28T22:58:11Z)
DeepDive: Advancing Deep Search Agents with Knowledge Graphs and Multi-Turn RL [60.47878242100153]
We present DeepDive to advance deep search agents.<n>We propose a strategy to automatically synthesize complex, difficult, and hard-to-find questions from open knowledge graphs.<n>We apply end-to-end multi-turn reinforcement learning to enhance LLMs' long-horizon reasoning with deep search.
arXiv Detail & Related papers (2025-09-12T17:52:35Z)
HierSearch: A Hierarchical Enterprise Deep Search Framework Integrating Local and Web Searches [54.65565885083031]
We propose a hierarchical agentic deep search framework, HierSearch, trained with hierarchical RL.<n>At the low level, a local deep search agent and a Web deep search agent are trained to retrieve evidence from their corresponding domains.<n>At the high level, a planner agent coordinates low-level agents and provides the final answer.
arXiv Detail & Related papers (2025-08-11T15:31:47Z)
BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent [74.10138164281618]
BrowseComp-Plus is a benchmark derived from BrowseComp, employing a fixed, carefully curated corpus.<n>This benchmark allows comprehensive evaluation and disentangled analysis of deep research agents and retrieval methods.
arXiv Detail & Related papers (2025-08-08T17:55:11Z)
DeepResearcher: Scaling Deep Research via Reinforcement Learning in Real-world Environments [20.498100965239818]
We introduce DeepResearcher, the first comprehensive framework for end-to-end training of LLM-based deep research agents.<n>Unlike RAG-based approaches that assume all necessary information exists within a fixed corpus, our method trains agents to navigate the noisy, unstructured, and dynamic nature of the open web.<n>Extensive experiments on open-domain research tasks demonstrate that DeepResearcher achieves substantial improvements of up to 28.9 points over prompt engineering-based baselines.
arXiv Detail & Related papers (2025-04-04T04:41:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.