Fugu-MT 論文翻訳(概要): Automated Benchmark Auditing for AI Agents and Large Language Models

論文の概要: Automated Benchmark Auditing for AI Agents and Large Language Models

arxiv url: http://arxiv.org/abs/2605.26079v2
Date: Tue, 26 May 2026 06:39:24 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-27 17:51:41.180846
Title: Automated Benchmark Auditing for AI Agents and Large Language Models
Title（参考訳）: AIエージェントと大規模言語モデルのための自動ベンチマーク監査
Authors: Junlin Wang, Federico Bianchi, Shang Zhu, Fan Nie, Yongchan Kwon, Bhuwan Dhingra, James Zou,
Abstract要約: Auto Benchmark Audit (ABA)は、個々のベンチマークタスクを体系的に監査するエージェントフレームワークである。私たちは、9つのドメインで合計168のベンチマークで、Frontier LLMベンチマークと以前のNeurIPSパブリッシュのコレクションでABAを実行しています。 ABAは、不明瞭なタスク設計、実行環境の矛盾、そして、評価されたタスクの25.7%以上において、誤った根拠の真実を含む重要な問題を特定する。
参考スコア（独自算出の注目度）: 46.03841647776303
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Modern AI benchmarks operate at a complexity that outpaces traditional verification methods. Tasks authored by domain experts often contain implicit assumptions, incomplete environment specifications, and brittle evaluation logic that human annotation cannot reliably catch. We introduce Auto Benchmark Audit (ABA), an agentic framework that systematically audits individual benchmark tasks, uncovering issues such as hidden environment dependencies, specification gaps, and limited grading logic. We run ABA on a collection of frontier LLM benchmarks and previous NeurIPS publications, totaling 168 benchmarks across nine domains. Across this corpus, ABA identifies critical issues including ambiguous task design, execution environment conflicts, and incorrect ground truths in over 25.7% of the evaluated tasks. The precision of these automated audits is validated by expert review and independent third-party reports such as upstream PRs. Crucially, we demonstrate that these problematic tasks severely distorts capability assessments for agents and LLMs: filtering out these tasks with issues shifts model rankings and increases average performance on SWE-bench Verified and Terminal-Bench 2 by 9.9% and 9.6%, respectively. We release the agentic tool and all task annotations to support the future development of frontier benchmarks.
Abstract（参考訳）: 現代のAIベンチマークは、従来の検証方法を上回る複雑さで運用されている。ドメインの専門家によって書かれたタスクは、暗黙の仮定、不完全な環境仕様、そして人間のアノテーションが確実にキャッチできない不安定な評価ロジックを含むことが多い。エージェントフレームワークであるAuto Benchmark Audit(ABA)を導入し、個別のベンチマークタスクを体系的に監査し、隠れた環境依存性や仕様のギャップ、限定的なグレーディングロジックといった問題を明らかにする。私たちは、9つのドメインで合計168のベンチマークで、Frontier LLMベンチマークと以前のNeurIPSパブリッシュのコレクションでABAを実行しています。このコーパス全体で、ABAは、不明瞭なタスク設計、実行環境の矛盾、評価されたタスクの25.7%以上において、誤った根拠の真実を含む重要な問題を特定する。これらの自動監査の精度は、専門家レビューと上流PRのような独立した第三者レポートによって検証される。本稿では,これらの課題がエージェントとLLMの能力評価を著しく歪めていることを実証する。これらの課題をモデルランキングにシフトさせ,SWE-bench Verified と Terminal-Bench 2 の平均性能を 9.9% と 9.6% に向上させる。私たちはエージェントツールとすべてのタスクアノテーションをリリースし、フロンティアベンチマークの将来の開発を支援します。

論文の概要: Automated Benchmark Auditing for AI Agents and Large Language Models

関連論文リスト