Fugu-MT 論文翻訳(概要): Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation

論文の概要: Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation

arxiv url: http://arxiv.org/abs/2510.11977v1
Date: Mon, 13 Oct 2025 22:22:28 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-15 19:02:32.104587
Title: Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation
Title（参考訳）: ホロスティックエージェントリーダーボード:AIエージェント評価の欠如基盤
Authors: Sayash Kapoor, Benedikt Stroebl, Peter Kirgis, Nitya Nadgir, Zachary S Siegel, Boyi Wei, Tianci Xue, Ziru Chen, Felix Chen, Saiteja Utpala, Franck Ndzomga, Dheeraj Oruganty, Sophie Luskin, Kangheng Liu, Botao Yu, Amit Arora, Dongyoon Hahm, Harsh Trivedi, Huan Sun, Juyong Lee, Tengjun Jin, Yifan Mai, Yifei Zhou, Yuxuan Zhu, Rishi Bommasani, Daniel Kang, Dawn Song, Peter Henderson, Yu Su, Percy Liang, Arvind Narayanan,
Abstract要約: 数百のタスクで並列評価をオーケストレーションする,標準化された評価ハーネスを提供する。モデル、足場、ベンチマークにまたがる3次元解析を行う。私たちの分析では、ほとんどのランで精度を低下させる高い推論努力など、驚くべき洞察が示されています。
参考スコア（独自算出の注目度）: 87.47155146067962
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: AI agents have been developed for complex real-world tasks from coding to customer service. But AI agent evaluations suffer from many challenges that undermine our understanding of how well agents really work. We introduce the Holistic Agent Leaderboard (HAL) to address these challenges. We make three main contributions. First, we provide a standardized evaluation harness that orchestrates parallel evaluations across hundreds of VMs, reducing evaluation time from weeks to hours while eliminating common implementation bugs. Second, we conduct three-dimensional analysis spanning models, scaffolds, and benchmarks. We validate the harness by conducting 21,730 agent rollouts across 9 models and 9 benchmarks in coding, web navigation, science, and customer service with a total cost of about $40,000. Our analysis reveals surprising insights, such as higher reasoning effort reducing accuracy in the majority of runs. Third, we use LLM-aided log inspection to uncover previously unreported behaviors, such as searching for the benchmark on HuggingFace instead of solving a task, or misusing credit cards in flight booking tasks. We share all agent logs, comprising 2.5B tokens of language model calls, to incentivize further research into agent behavior. By standardizing how the field evaluates agents and addressing common pitfalls in agent evaluation, we hope to shift the focus from agents that ace benchmarks to agents that work reliably in the real world.
Abstract（参考訳）: AIエージェントは、コーディングからカスタマサービスまで、複雑な現実世界のタスクのために開発されている。しかし、AIエージェントの評価は、エージェントが実際にどのように機能するかの理解を損なう多くの課題に悩まされている。これらの課題に対処するために、HAL(Holistic Agent Leaderboard)を導入します。主な貢献は3つある。まず、何百ものVMで並列評価をオーケストレーションし、一般的な実装バグを排除しながら、数週間から数時間まで評価時間を短縮する、標準化された評価ハーネスを提供する。第2に、モデル、足場、ベンチマークにまたがる3次元解析を行う。私たちは、コーディング、Webナビゲーション、科学、カスタマーサービスにおいて、9つのモデルと9つのベンチマークで21,730のエージェントロールアウトを実行することで、ハーネスを検証する。我々の分析では、ほとんどのランで精度を低下させる高い推論努力など、驚くべき洞察が浮かび上がっています。第3に、LLM支援ログ検査を使用して、HuggingFaceのベンチマーク検索や、フライト予約タスクにおけるクレジットカードの誤使用など、未報告の動作を明らかにする。我々は、言語モデル呼び出しの2.5Bトークンを含む全てのエージェントログを共有し、エージェントの振る舞いに関するさらなる研究を動機づける。エージェントの評価方法の標準化とエージェント評価における一般的な落とし穴への対処により、ベンチマークを行うエージェントから、現実世界で確実に動作するエージェントへと焦点を移したいと考えています。

論文の概要: Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation

関連論文リスト