Fugu-MT 論文翻訳(概要): Efficient Benchmarking of AI Agents

論文の概要: Efficient Benchmarking of AI Agents

arxiv url: http://arxiv.org/abs/2603.23749v1
Date: Tue, 24 Mar 2026 22:17:11 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-26 21:06:11.038879
Title: Efficient Benchmarking of AI Agents
Title（参考訳）: AIエージェントの効率的なベンチマーク
Authors: Franck Ndzomga,
Abstract要約: 小型タスクサブセットがエージェントランキングを極めて低コストで維持できるかどうかを検討する。絶対スコア予測は足場駆動の分布シフトで低下することがわかった。本稿では,中間的履歴パス率を持つタスクに対してのみ,新しいエージェントを評価できる最適化フリープロトコルを提案する。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Evaluating AI agents on comprehensive benchmarks is expensive because each evaluation requires interactive rollouts with tool use and multi-step reasoning. We study whether small task subsets can preserve agent rankings at substantially lower cost. Unlike static language model benchmarks, agent evaluation is subject to scaffold-driven distribution shift, since performance depends on the framework wrapping the underlying model. Across eight benchmarks, 33 agent scaffolds, and 70+ model configurations, we find that absolute score prediction degrades under this shift, while rank-order prediction remains stable. Exploiting this asymmetry, we propose a simple optimization-free protocol: evaluate new agents only on tasks with intermediate historical pass rates (30-70%). This mid-range difficulty filter, motivated by Item Response Theory, reduces the number of evaluation tasks by 44-70% while maintaining high rank fidelity under scaffold and temporal shifts. It provides more reliable rankings than random sampling, which exhibits high variance across seeds, and outperforms greedy task selection under distribution shift. These results suggest that reliable leaderboard ranking does not require full-benchmark evaluation.
Abstract（参考訳）: 包括的なベンチマークでAIエージェントを評価するのは、ツールの使用とマルチステップ推論を備えたインタラクティブなロールアウトを必要とするため、コストがかかる。小型タスクサブセットがエージェントランキングを極めて低コストで維持できるかどうかを検討する。静的言語モデルベンチマークとは異なり、エージェント評価は、基盤となるモデルをラップするフレームワークに依存するため、足場駆動の分散シフトの対象となる。 8つのベンチマーク、33のエージェント足場、70以上のモデル構成で、絶対スコア予測はこのシフトの下で低下するが、ランク順予測は安定である。本研究では,この非対称性をエクスプロイトし,中間的履歴パスレート(30～70%)のタスクに対してのみ,新しいエージェントを評価する,単純な最適化不要なプロトコルを提案する。この中距離難易度フィルタは、項目応答理論によって動機付けられ、足場と時間シフトの下で高い等級の忠実度を維持しながら、評価タスクの数を44-70%削減する。ランダムサンプリングよりも信頼性の高いランキングを提供し、種子間で高いばらつきを示し、分散シフト下でのグリーディタスク選択よりも優れています。これらの結果から,信頼性の高いリーダボードのランキングは,完全なベンチマーク評価を必要としないことが示唆された。

論文の概要: Efficient Benchmarking of AI Agents

関連論文リスト