Related papers: Hunt Instead of Wait: Evaluating Deep Data Research on Large Language Models

Hunt Instead of Wait: Evaluating Deep Data Research on Large Language Models

URL: http://arxiv.org/abs/2602.02039v1
Date: Mon, 02 Feb 2026 12:36:57 GMT
Title: Hunt Instead of Wait: Evaluating Deep Data Research on Large Language Models
Authors: Wei Liu, Peijie Yu, Michele Orini, Yali Du, Yulan He,
Abstract summary: Agency expected of Agentic Large Language Models goes beyond answering correctly, requiring autonomy to set goals and decide what to explore.<n>We term this investigatory intelligence, distinguishing it from executional intelligence, which merely completes assigned tasks.<n>To address this, we introduce Deep Data Research ( DDR), an open-ended task where LLMs autonomously extract key insights from databases, and DDR-Bench, a large-scale, checklist-based benchmark that enables verifiable evaluation.
Score: 19.85460397012729
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The agency expected of Agentic Large Language Models goes beyond answering correctly, requiring autonomy to set goals and decide what to explore. We term this investigatory intelligence, distinguishing it from executional intelligence, which merely completes assigned tasks. Data Science provides a natural testbed, as real-world analysis starts from raw data rather than explicit queries, yet few benchmarks focus on it. To address this, we introduce Deep Data Research (DDR), an open-ended task where LLMs autonomously extract key insights from databases, and DDR-Bench, a large-scale, checklist-based benchmark that enables verifiable evaluation. Results show that while frontier models display emerging agency, long-horizon exploration remains challenging. Our analysis highlights that effective investigatory intelligence depends not only on agent scaffolding or merely scaling, but also on intrinsic strategies of agentic models.

Related papers

SpotAgent: Grounding Visual Geo-localization in Large Vision-Language Models through Agentic Reasoning [31.665287327579026]
SpotAgent is a framework that formalizes geo-localization into an agentic reasoning process.<n>It actively explores and verifies visual cues by leveraging external tools (e.g., web search, maps) through a ReAct diagram.<n>It achieves state-of-the-art performance, effectively mitigating hallucinations while delivering precise and verifiable geo-localization.
arXiv Detail & Related papers (2026-02-10T06:57:12Z)
GISA: A Benchmark for General Information-Seeking Assistant [102.30831921333755]
GISA is a benchmark for General Information-Seeking Assistants comprising 373 human-crafted queries.<n>It integrates both deep reasoning and broad information aggregation within unified tasks, and includes a live subset with periodically updated answers to resist memorization.<n>Experiments on mainstream LLMs and commercial search products reveal that even the best-performing model achieves only 19.30% exact match score.
arXiv Detail & Related papers (2026-02-09T11:44:15Z)
DeepSearchQA: Bridging the Comprehensiveness Gap for Deep Research Agents [10.197402632091551]
DeepSearchQA is a 900-prompt benchmark for evaluating agents on difficult multi-step information-seeking tasks.<n>This dataset is designed to evaluate an agent's ability to execute complex search plans to generate exhaustive answer lists.
arXiv Detail & Related papers (2026-01-28T19:20:47Z)
Step-DeepResearch Technical Report [90.50586290399683]
We introduce Step-DeepResearch, a cost-effective, end-to-end agent.<n>We propose a Data Synthesis Strategy Based on Atomic Capabilities to reinforce planning and report writing.<n>To bridge the evaluation gap in the Chinese domain, we establish ADR-Bench for realistic deep research scenarios.
arXiv Detail & Related papers (2025-12-23T16:32:27Z)
InsightEval: An Expert-Curated Benchmark for Assessing Insight Discovery in LLM-Driven Data Agents [31.43134407708759]
We develop a data-curation pipeline to construct a new dataset named InsightEval.<n>We highlight prevailing challenges in automated insight discovery and raise some key findings to guide future research.
arXiv Detail & Related papers (2025-11-28T05:19:24Z)
PRInTS: Reward Modeling for Long-Horizon Information Seeking [74.14496236655911]
We introduce PRInTS, a generative PRM trained with dual capabilities.<n>We show that PRInTS enhances information-seeking abilities of open-source models as well as specialized agents.
arXiv Detail & Related papers (2025-11-24T17:09:43Z)
SFR-DeepResearch: Towards Effective Reinforcement Learning for Autonomously Reasoning Single Agents [93.26456498576181]
This paper focuses on the development of native Autonomous Single-Agent models for Deep Research.<n>Our best variant SFR-DR-20B achieves up to 28.7% on Humanity's Last Exam benchmark.
arXiv Detail & Related papers (2025-09-08T02:07:09Z)
DatasetResearch: Benchmarking Agent Systems for Demand-Driven Dataset Discovery [26.388978716803464]
Can AI agents transcend conventional search to systematically discover any dataset that meets specific user requirements?<n>Our benchmark and comprehensive analysis provide the foundation for the next generation of self-improving AI systems.
arXiv Detail & Related papers (2025-08-09T12:15:08Z)
From Web Search towards Agentic Deep Research: Incentivizing Search with Reasoning Agents [96.65646344634524]
Large Language Models (LLMs), endowed with reasoning and agentic capabilities, are ushering in a new paradigm termed Agentic Deep Research.<n>We trace the evolution from static web search to interactive, agent-based systems that plan, explore, and learn.<n>We demonstrate that Agentic Deep Research not only significantly outperforms existing approaches, but is also poised to become the dominant paradigm for future information seeking.
arXiv Detail & Related papers (2025-06-23T17:27:19Z)
BLADE: Benchmarking Language Model Agents for Data-Driven Science [21.682416167339635]
LM-based agents equipped with planning, memory, and code execution capabilities have the potential to support data-driven science.<n>We present BLADE, a benchmark to automatically evaluate agents' multifaceted approaches to open-ended research questions.
arXiv Detail & Related papers (2024-08-19T02:59:35Z)
DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery. Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering. Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.