Fugu-MT 論文翻訳(概要): BenchGuard: Who Guards the Benchmarks? Automated Auditing of LLM Agent Benchmarks

論文の概要: BenchGuard: Who Guards the Benchmarks? Automated Auditing of LLM Agent Benchmarks

arxiv url: http://arxiv.org/abs/2604.24955v1
Date: Mon, 27 Apr 2026 19:51:25 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-29 16:49:17.579027
Title: BenchGuard: Who Guards the Benchmarks? Automated Auditing of LLM Agent Benchmarks
Title（参考訳）: BenchGuard: LLMエージェントベンチマークの監査を自動化するのは誰か?
Authors: Xinming Tu, Tianze Wang, Yingzhou, Lu, Kexin Huang, Yuanhao Qu, Sara Mostafavi,
Abstract要約: BenchGuardはタスク指向、実行ベースのエージェントベンチマークのための最初の自動監査フレームワークである。それは、ScienceAgentBenchの12の著者確認問題と、BIXBench Verified-50サブセットのエキスパート特定問題の83.3%を特定している。 USD 15の50の複雑なバイオインフォマティクスタスクの完全な監査により、自動ベンチマーク監査は人間によるレビューの実践的で価値のある補完となる。
参考スコア（独自算出の注目度）: 26.58983143152204
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As benchmarks grow in complexity, many apparent agent failures are not failures of the agent at all - they are failures of the benchmark itself: broken specifications, implicit assumptions, and rigid evaluation scripts that penalize valid alternative approaches. We propose employing frontier LLMs as systematic auditors of evaluation infrastructure, and realize this vision through BenchGuard, the first automated auditing framework for task-oriented, execution-based agent benchmarks. BenchGuard cross-verifies all benchmark artifacts via structured LLM protocols, optionally incorporating agent solutions or execution traces as additional diagnostic evidence. Deployed on two prominent scientific benchmarks, BenchGuard identified 12 author-confirmed issues in ScienceAgentBench - including fatal errors rendering tasks unsolvable - and exactly matched 83.3% of expert-identified issues on the BIXBench Verified-50 subset, catching defects that prior human review missed entirely. A full audit of 50 complex bioinformatics tasks costs under USD 15, making automated benchmark auditing a practical and valuable complement to human review. These findings point toward AI-assisted benchmark development, where frontier models serve not only as subjects of evaluation but as active participants in validating the evaluation infrastructure itself.
Abstract（参考訳）: ベンチマークが複雑化するにつれて、明らかなエージェントの失敗の多くはエージェントの失敗ではなく、ベンチマーク自体の失敗である。評価基盤の体系的な監査手段としてフロンティア LLM を採用することを提案し,タスク指向型エージェントベンチマークのための最初の自動監査フレームワークであるBenchGuard を通じて,このビジョンを実現する。 BenchGuardは構造化LDMプロトコルを通じてすべてのベンチマークアーティファクトを横断的に検証し、追加の診断証拠としてエージェントソリューションや実行トレースを任意に組み込む。 2つの著名な科学的ベンチマーク上に展開されたBenchGuardは、ScienceAgentBenchの12の著者確認問題(致命的なエラーレンダリングタスクを含む)を特定し、BIXBench Verified-50サブセットの83.3%と正確に一致した。 USD 15の50の複雑なバイオインフォマティクスタスクの完全な監査により、自動ベンチマーク監査は人間によるレビューの実践的で価値のある補完となる。これらの知見は、フロンティアモデルが評価対象として機能するだけでなく、評価基盤自体の検証に積極的な参加者として機能する、AI支援型ベンチマーク開発に向けられている。

論文の概要: BenchGuard: Who Guards the Benchmarks? Automated Auditing of LLM Agent Benchmarks

関連論文リスト