Fugu-MT 論文翻訳(概要): ELT-Bench-Verified: Benchmark Quality Issues Underestimate AI Agent Capabilities

論文の概要: ELT-Bench-Verified: Benchmark Quality Issues Underestimate AI Agent Capabilities

arxiv url: http://arxiv.org/abs/2603.29399v2
Date: Thu, 02 Apr 2026 17:59:44 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-03 14:21:09.263302
Title: ELT-Bench-Verified: Benchmark Quality Issues Underestimate AI Agent Capabilities
Title（参考訳）: ELT-Bench-Verified: ベンチマーク品質の問題 - AIエージェントの能力の過小評価
Authors: Christopher Zanoli, Andrea Giovannini, Tengjun Jin, Ana Klimovic, Yotam Perlitz,
Abstract要約: Extract-Load-Transformパイプラインは、労働集約的なデータエンジニアリングタスクであり、AI自動化の高インパクトターゲットである。エンドツーエンドのETLパイプライン構築のための最初のベンチマークであるETL-Benchでは、AIエージェントが最初に成功率を低くした。これらの結果を再検討し,エージェント能力を著しく過小評価する要因を2つ同定した。
参考スコア（独自算出の注目度）: 4.5258165293324515
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Constructing Extract-Load-Transform (ELT) pipelines is a labor-intensive data engineering task and a high-impact target for AI automation. On ELT-Bench, the first benchmark for end-to-end ELT pipeline construction, AI agents initially showed low success rates, suggesting they lacked practical utility. We revisit these results and identify two factors causing a substantial underestimation of agent capabilities. First, re-evaluating ELT-Bench with upgraded large language models reveals that the extraction and loading stage is largely solved, while transformation performance improves significantly. Second, we develop an Auditor-Corrector methodology that combines scalable LLM-driven root-cause analysis with rigorous human validation (inter-annotator agreement Fleiss' kappa = 0.85) to audit benchmark quality. Applying this to ELT-Bench uncovers that most failed transformation tasks contain benchmark-attributable errors -- including rigid evaluation scripts, ambiguous specifications, and incorrect ground truth -- that penalize correct agent outputs. Based on these findings, we construct ELT-Bench-Verified, a revised benchmark with refined evaluation logic and corrected ground truth. Re-evaluating on this version yields significant improvement attributable entirely to benchmark correction. Our results show that both rapid model improvement and benchmark quality issues contributed to underestimating agent capabilities. More broadly, our findings echo observations of pervasive annotation errors in text-to-SQL benchmarks, suggesting quality issues are systemic in data engineering evaluation. Systematic quality auditing should be standard practice for complex agentic tasks. We release ELT-Bench-Verified to provide a more reliable foundation for progress in AI-driven data engineering automation.
Abstract（参考訳）: Extract-Load-Transform(ELT)パイプラインの構築は、労働集約型データエンジニアリングタスクであり、AI自動化のための高インパクトターゲットである。エンドツーエンドのETLパイプライン構築のための最初のベンチマークであるETL-Benchでは、AIエージェントが最初、成功率が低く、実用性に欠けていたことが示唆された。これらの結果を再検討し,エージェント能力を著しく過小評価する要因を2つ同定した。第一に,改良された大規模言語モデルを用いたERT-Benchの再評価により,抽出およびロードステージが大幅に解決され,変換性能が大幅に向上した。第2に,拡張性LLM駆動根本原因分析と厳密な人間検証(Fleiss' kappa = 0.85)を組み合わせて,ベンチマーク品質の評価を行うオーディタ・コレクタ手法を開発した。このことをETL-Benchに適用することで、ほとんどの失敗する変換タスクは、厳密な評価スクリプト、曖昧な仕様、不正な基底真理を含む、適切なエージェント出力をペナルライズするベンチマーク帰属的なエラーを含むことが判明した。これらの結果に基づき,改良された評価論理と修正された土台真理のベンチマークであるELT-Bench-Verifiedを構築した。このバージョンの再評価は、ベンチマークの修正に大きく貢献する。その結果,モデル改善とベンチマーク品質の問題の両方が,エージェント能力の過小評価に寄与していることがわかった。より広範に,テキスト・トゥ・SQLベンチマークにおける広範囲なアノテーションエラーの観察を反映し,データ工学的評価において品質の問題が体系的であることを示唆している。システム品質監査は、複雑なエージェントタスクの標準的なプラクティスであるべきです。 ELT-Bench-Verifiedをリリースし、AI駆動のデータエンジニアリング自動化の進歩のための、より信頼性の高い基盤を提供します。

論文の概要: ELT-Bench-Verified: Benchmark Quality Issues Underestimate AI Agent Capabilities

関連論文リスト