Fugu-MT 論文翻訳(概要): AIDABench: AI Data Analytics Benchmark

論文の概要: AIDABench: AI Data Analytics Benchmark

arxiv url: http://arxiv.org/abs/2603.15636v1
Date: Fri, 27 Feb 2026 08:58:05 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-23 08:17:42.328988
Title: AIDABench: AI Data Analytics Benchmark
Title（参考訳）: AIDABench: AIデータ分析ベンチマーク
Authors: Yibo Yang, Fei Lei, Yixuan Sun, Yantao Zeng, Chengguang Lv, Jiancao Hong, Jiaojiao Tian, Tianyu Qiu, Xin Wang, Yanbing Chen, Yanjie Li, Zheng Pan, Xiaochen Zhou, Guanzhou Chen, Haoran Lv, Yuning Xu, Yue Ou, Haodong Liu, Shiqi He, Anya Jia, Yulei Xin, Huan Wu, Liang Liu, Jiaye Ge, Jianxin Dong, Dahua Lin, Wenxiu Sun,
Abstract要約: AIDABenchは、複雑なデータ分析タスクのAIシステムをエンドツーエンドで評価するためのベンチマークである。 AIDABenchは、質問応答、データビジュアライゼーション、ファイル生成という3つのコア機能ディメンションにまたがる600以上の多様なドキュメント分析タスクを含んでいる。 AIDABenchの11の最先端モデルを評価し、プロプライエタリ(Claude Sonnet 4.5、Gemini 3 Pro Previewなど)とオープンソース(Qwen3-Max-2026-01-23-Thinkingなど)の両方を対象とする。
参考スコア（独自算出の注目度）: 62.45908988324612
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As AI-driven document understanding and processing tools become increasingly prevalent in real-world applications, the need for rigorous evaluation standards has grown increasingly urgent. Existing benchmarks and evaluations often focus on isolated capabilities or simplified scenarios, failing to capture the end-to-end task effectiveness required in practical settings. To address this gap, we introduce AIDABench, a comprehensive benchmark for evaluating AI systems on complex data analytics tasks in an end-to-end manner. AIDABench encompasses 600+ diverse document analysis tasks across three core capability dimensions: question answering, data visualization, and file generation. These tasks are grounded in realistic scenarios involving heterogeneous data types, including spreadsheets, databases, financial reports, and operational records, and reflect analytical demands across diverse industries and job functions. Notably, the tasks in AIDABench are sufficiently challenging that even human experts require 1-2 hours per question when assisted by AI tools, underscoring the benchmark's difficulty and real-world complexity. We evaluate 11 state-of-the-art models on AIDABench, spanning both proprietary (e.g., Claude Sonnet 4.5, Gemini 3 Pro Preview) and open-source (e.g., Qwen3-Max-2026-01-23-Thinking) families. Our results reveal that complex, real-world data analytics tasks remain a significant challenge for current AI systems, with the best-performing model achieving only 59.43% pass-at-1. We provide a detailed analysis of failure modes across each capability dimension and identify key challenges for future research. AIDABench offers a principled reference for enterprise procurement, tool selection, and model optimization, and is publicly available at https://github.com/MichaelYang-lyx/AIDABench.
Abstract（参考訳）: AIによる文書の理解と処理ツールが現実世界のアプリケーションでますます普及するにつれて、厳格な評価基準の必要性が高まっている。既存のベンチマークと評価は、多くの場合、独立した機能や単純化されたシナリオに重点を置いており、実用的な設定で必要とされるエンドツーエンドのタスクの有効性を捉えていない。このギャップに対処するために、複雑なデータ分析タスク上のAIシステムを評価するための包括的なベンチマークであるAIDABenchをエンドツーエンドで導入する。 AIDABenchは、質問応答、データビジュアライゼーション、ファイル生成という3つのコア機能ディメンションにまたがる600以上の多様なドキュメント分析タスクを含んでいる。これらのタスクは、スプレッドシート、データベース、財務報告、運用記録を含む異種データ型を含む現実的なシナリオに基づいており、さまざまな産業や業務機能に対する分析的な要求を反映している。特に、AIDABenchのタスクは、AIツールによって支援された場合、人間の専門家でさえ1つの質問に1～2時間を要するほどに困難である。 AIDABenchの11の最先端モデルを評価し、プロプライエタリ(例: Claude Sonnet 4.5, Gemini 3 Pro Preview)とオープンソース(例: Qwen3-Max-2026-01-23-Thinking)の両方を対象とする。我々の結果によると、複雑な実世界のデータ分析タスクは、現在のAIシステムにとって重要な課題であり、最高のパフォーマンスモデルは59.43%のパスアット-1しか達成していない。我々は,各機能領域の障害モードを詳細に分析し,今後の研究における重要な課題を特定する。 AIDABenchは、企業の調達、ツールの選択、モデルの最適化に関する原則化されたリファレンスを提供し、https://github.com/MichaelYang-lyx/AIDABenchで公開されている。

論文の概要: AIDABench: AI Data Analytics Benchmark

関連論文リスト