Fugu-MT 論文翻訳(概要): DataClaw: A Process-Oriented Agent Benchmark for Exploratory Real-World Data Analysis

論文の概要: DataClaw: A Process-Oriented Agent Benchmark for Exploratory Real-World Data Analysis

arxiv url: http://arxiv.org/abs/2605.02503v1
Date: Mon, 04 May 2026 11:57:09 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-05 20:33:50.271077
Title: DataClaw: A Process-Oriented Agent Benchmark for Exploratory Real-World Data Analysis
Title（参考訳）: DataClaw: リアルタイムデータ分析のためのプロセス指向エージェントベンチマーク
Authors: Qiaohong Zhang, Weihao Ye, Jialong Chen, Yi Luo, BoYuan Li, Bowen Deng, Zibin Zheng, Jianhao Lin, Wei-Shi Zheng, Chuan Chen,
Abstract要約: DataClawは、探索的実世界のデータ分析のためのプロセス指向のベンチマークである。企業、産業、および政策ドメイン全体で約2億6600万の現実世界の記録がある。 DataClawは、エージェントがどこまで進歩し、その推論がどこで壊れるかを測定する。
参考スコア（独自算出の注目度）: 76.98578575566184
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Evaluating autonomous data analysis agents requires testing their ability to perform exploratory analysis in underexplored data environments. However, many existing benchmarks emphasize final answer accuracy in prior-guided data settings and provide limited support for reasoning process evaluation. We introduce DataClaw, a process-oriented benchmark for exploratory real-world data analysis. DataClaw contains approximately 2.06 million real-world records across enterprise, industry and policy domains, with native data noise preserved. It further includes 492 cross-domain tasks derived from think-tank consulting scenarios, each annotated with intermediate milestones for process-level evaluation. These annotations allow DataClaw to measure how far an agent progresses and where its reasoning breaks down. Experiments with eight advanced LLMs show that current agents remain far from reliable in this setting, with seven models achieving below 50% overall accuracy. Process analysis further reveals partial progress hidden behind wrong answers and distinct exploration strategies across models. Overall, DataClaw provides a less data constrained diagnostic testbed for probing the capability boundaries of autonomous data-analysis agents.
Abstract（参考訳）: 自律的なデータ分析エージェントを評価するには、未調査のデータ環境で探索分析を行う能力をテストする必要がある。しかし、多くの既存のベンチマークでは、事前誘導されたデータ設定において最終回答の精度を強調し、推論プロセスの評価を限定的にサポートしている。我々は,探索的実世界のデータ分析のためのプロセス指向ベンチマークであるDataClawを紹介する。 DataClawには、企業、業界、ポリシードメインにまたがる約2600万の実世界記録があり、ネイティブなデータノイズが保存されている。さらに、シンクタンクコンサルティングシナリオから派生した492のクロスドメインタスクが含まれており、それぞれにプロセスレベルの評価のための中間的なマイルストーンがアノテートされている。これらのアノテーションにより、DataClawは、エージェントがどれくらい進歩し、その推論がどこで壊れるかを測定することができる。 8つの先進的なLCMによる実験では、現在のエージェントはこの設定では信頼性が低く、7つのモデルが全体の50%未満の精度で達成されている。プロセス分析はさらに、間違った回答の裏に隠れた部分的な進歩と、モデル間の明確な探索戦略を明らかにしている。全体として、DataClawは、自律的なデータ分析エージェントの能力境界を探索するための、データ制約の少ない診断テストベッドを提供する。

論文の概要: DataClaw: A Process-Oriented Agent Benchmark for Exploratory Real-World Data Analysis

関連論文リスト