Fugu-MT 論文翻訳(概要): An Empirical Study of Automating Agent Evaluation

論文の概要: An Empirical Study of Automating Agent Evaluation

arxiv url: http://arxiv.org/abs/2605.11378v1
Date: Tue, 12 May 2026 01:06:34 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-13 21:48:56.50395
Title: An Empirical Study of Automating Agent Evaluation
Title（参考訳）: 自動エージェント評価の実証的研究
Authors: Kang Zhou, Sangmin Woo, Haibo Ding, Kiran Ramnath, Subramanian Chidambaram, Aosong Feng, Vinayak Arannil, Muhyun Kim, Ishan Singh, Darren Wang, Zhichao Xu, Megha Gandhi, Nirmal Prabhu, Soumya Smruti Mishra, Vivek Singh, Gouri Pandeshwar, Lin Lee Cheong,
Abstract要約: エンドツーエンドエージェント評価パイプラインを自動化するAIアシスタントであるEvalAgentを紹介する。 EvalAgentは評価ドメインの専門知識を評価スキルとしてエンコードする。 EvalAgentは、焦点を絞った評価を行い、Eval@1を17.5%から65%に改善し、ベースラインアプローチよりも79.5%の人間専門家の選好を達成した。
参考スコア（独自算出の注目度）: 14.239299198848764
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Agent evaluation requires assessing complex multi-step behaviors involving tool use and intermediate reasoning, making it costly and expertise-intensive. A natural question arises: can frontier coding assistants reliably automate this evaluation process? Our study shows that simply prompting coding assistants is insufficient for this task. Without domain-specific evaluation knowledge, frontier coding assistants achieve only a 30% execution success rate and produce over-engineered evaluations averaging 12+ metrics per agent, indicating that strong coding ability does not automatically translate to reliable agent evaluation. We introduce EvalAgent, an AI assistant that automates the end-to-end agent evaluation pipeline. EvalAgent encodes evaluation domain expertise as evaluation skills (procedural instructions, reusable code and templates, and dynamically retrieved API documentation) that compose into a trace-based pipeline producing complete evaluation artifacts including metrics, executable code, and reports. To systematically assess generated evaluations, we introduce a meta-evaluation framework alongside AgentEvalBench, a benchmark comprising 20 agents, each paired with evaluation requirements and test scenarios. We further propose the Eval@1 metric to measure whether generated evaluation code both executes and yields meaningful results on the first run. Our experiments show that EvalAgent produces focused evaluations, improving Eval@1 from 17.5% to 65%, and achieving 79.5% human expert preference over baseline approaches. Further ablation studies show that evaluation skills are critical for handling complex evaluation: removing them causes Eval@1 to drop significantly from 65% to 30%.
Abstract（参考訳）: エージェント評価は、ツールの使用と中間的推論を含む複雑な多段階の振る舞いを評価することを必要とし、コストと専門性に重点を置いている。フロンティアコーディングアシスタントは、この評価プロセスを確実に自動化できるか? 本研究は, この課題に対して, コーディングアシスタントの促進が不十分であることを示す。ドメイン固有の評価知識がなければ、フロンティアコーディングアシスタントは30%の実行成功率しか達成せず、エージェント当たり平均12以上のメトリクスをオーバーエンジニアリングで評価し、強力なコーディング能力が自動的に信頼できるエージェント評価に変換されないことを示す。エンドツーエンドエージェント評価パイプラインを自動化するAIアシスタントであるEvalAgentを紹介する。 EvalAgentは評価ドメインの専門知識を評価スキル(手続き的インストラクション、再利用可能なコードとテンプレート、動的に取得されるAPIドキュメント)としてエンコードし、トレースベースのパイプラインを構成し、メトリクス、実行可能なコード、レポートを含む完全な評価成果物を生成する。生成した評価を体系的に評価するために,20エージェントからなるベンチマークであるAgentEvalBenchとともにメタ評価フレームワークを導入する。さらに、生成した評価コードが実行され、最初の実行時に有意義な結果が得られるかどうかを測定するために、Eval@1メトリックを提案する。 EvalAgentは17.5%から65%に改善し、ベースラインアプローチよりも79.5%の人間専門家の選好を達成した。 Eval@1を除去すると、Eval@1は65%から30%に大幅に低下する。

論文の概要: An Empirical Study of Automating Agent Evaluation

関連論文リスト