Fugu-MT 論文翻訳(概要): Hunt Instead of Wait: Evaluating Deep Data Research on Large Language Models

論文の概要: Hunt Instead of Wait: Evaluating Deep Data Research on Large Language Models

arxiv url: http://arxiv.org/abs/2602.02039v1
Date: Mon, 02 Feb 2026 12:36:57 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-03 19:28:34.146206
Title: Hunt Instead of Wait: Evaluating Deep Data Research on Large Language Models
Title（参考訳）: 待機の代わりにHunt:大規模言語モデルに関する詳細なデータ研究を評価する
Authors: Wei Liu, Peijie Yu, Michele Orini, Yali Du, Yulan He,
Abstract要約: エージェント型大規模言語モデルに対するエージェンシーの期待は、目標を設定し、何を探索するかを決めるために自主性を必要とする、正しく答える以上のものだ。我々は、この調査インテリジェンスを、単に割り当てられたタスクを完了させる実行インテリジェンスと区別して、定義する。これを解決するために、LLMがデータベースから重要な洞察を自律的に抽出するオープンなタスクであるDeep Data Research (DDR)と、検証可能な評価を可能にする大規模なチェックリストベースのベンチマークであるDDR-Benchを紹介する。
参考スコア（独自算出の注目度）: 19.85460397012729
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The agency expected of Agentic Large Language Models goes beyond answering correctly, requiring autonomy to set goals and decide what to explore. We term this investigatory intelligence, distinguishing it from executional intelligence, which merely completes assigned tasks. Data Science provides a natural testbed, as real-world analysis starts from raw data rather than explicit queries, yet few benchmarks focus on it. To address this, we introduce Deep Data Research (DDR), an open-ended task where LLMs autonomously extract key insights from databases, and DDR-Bench, a large-scale, checklist-based benchmark that enables verifiable evaluation. Results show that while frontier models display emerging agency, long-horizon exploration remains challenging. Our analysis highlights that effective investigatory intelligence depends not only on agent scaffolding or merely scaling, but also on intrinsic strategies of agentic models.
Abstract（参考訳）: エージェント型大規模言語モデル(Agenic Large Language Models, エージェント型言語モデル)は,目標を設定し,何を探せばよいのかを判断するために,自律性を必要としている。我々は、この調査インテリジェンスを、単に割り当てられたタスクを完了させる実行インテリジェンスと区別して、定義する。データサイエンスは自然なテストベッドを提供し、実世界の分析は明示的なクエリではなく生のデータから始まるが、それに焦点を当てたベンチマークはほとんどない。これを解決するために、LLMがデータベースから重要な洞察を自律的に抽出するオープンなタスクであるDeep Data Research (DDR)と、検証可能な評価を可能にする大規模なチェックリストベースのベンチマークであるDDR-Benchを紹介する。その結果、フロンティアモデルが新興企業を示す一方で、長距離探査は依然として困難であることが判明した。我々の分析では,効果的な調査知能はエージェントの足場だけでなく,エージェントモデルの本質的な戦略にも依存している。

論文の概要: Hunt Instead of Wait: Evaluating Deep Data Research on Large Language Models

関連論文リスト