Fugu-MT 論文翻訳(概要): EgoBench: An Interactive Egocentric Multimodal Benchmark for Tool-Using Agents

論文の概要: EgoBench: An Interactive Egocentric Multimodal Benchmark for Tool-Using Agents

arxiv url: http://arxiv.org/abs/2605.27820v1
Date: Wed, 27 May 2026 01:28:15 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-28 17:38:55.66968
Title: EgoBench: An Interactive Egocentric Multimodal Benchmark for Tool-Using Agents
Title（参考訳）: EgoBench: ツール使用エージェントのためのインタラクティブなエゴセントリックなマルチモーダルベンチマーク
Authors: Yunqi Liu, Tong Niu, Zitong Wang, Zhenlong Dai, Yuqi Qing, Weiqiang Wang, Jian Liu,
Abstract要約: ツール使用エージェントのための対話型マルチモーダルベンチマークであるEgoBenchを紹介する。我々は3段階の相乗的パイプラインを実装し、各タスクは視覚知覚とツール強化マルチホップ推論の併用を強制的に行うように設計されている。また,エージェントのインタラクション能力を評価するために,EgoBench内のマルチエージェントシミュレーションユーザを開発した。
参考スコア（独自算出の注目度）: 17.727481701114556
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As AI agents increasingly operate in open, real-world environments, they require a deep synergy of multimodal perception, tool invocation with multi-hop reasoning, and dynamic interaction with users. However, existing benchmarks fail to jointly evaluate these capabilities due to challenges in designing strictly coupled multi-capability tasks, simulating natural and task-constrained user feedback, and ensuring objective evaluation of dynamic interaction. To bridge this gap, we introduce EgoBench, the first interactive multimodal benchmark for tool-using agents. EgoBench comprises 1,045 egocentric-video-grounded tasks covering four daily scenarios, along with a user-agent-tool interactive environment for evaluation. We implement a three-stage synergistic pipeline through which each task is designed to enforce the joint application of visual perception and tool-augmented multi-hop reasoning. We additionally develop a multi-agent simulated user within EgoBench to evaluate agents' interaction capabilities, which generates high-fidelity, task-aligned responses to agents. Furthermore, we establish a deterministic joint validation framework that guarantees objective assessment through process-based and result-based equivalence. Benchmarking eight SOTA video-MLLM agents on EgoBench reveals a severe performance ceiling: the best model achieves only 30.62% accuracy in the best-performing scenario, averaging 19.43% across all four scenarios. Finally, we conduct a multi-dimensional error analysis to disentangle failure modes, exposing capability bottlenecks for advancing future AI agents.
Abstract（参考訳）: AIエージェントがオープンで現実世界の環境でますます運用されるようになると、マルチモーダルな認識、マルチホップ推論によるツール呼び出し、ユーザとの動的インタラクションの深いシナジーが必要になる。しかし、既存のベンチマークは、厳密に結合された多機能タスクを設計し、自然およびタスク制約されたユーザフィードバックをシミュレートし、動的相互作用の客観的評価を確保するという課題のために、これらの機能を共同で評価することができない。このギャップを埋めるために、ツール使用エージェントのための対話型マルチモーダルベンチマークであるEgoBenchを紹介します。 EgoBenchは、毎日4つのシナリオをカバーする、エゴセントリックなビデオグラウンドタスク1,045と、ユーザエージェントとツールの対話的な環境から成り立っている。我々は3段階の相乗的パイプラインを実装し、各タスクは視覚知覚とツール強化マルチホップ推論の併用を強制的に行うように設計されている。また,エージェントのインタラクション能力を評価するために,EgoBench内のマルチエージェントシミュレーションユーザを開発し,エージェントに対する高忠実でタスク整合性のある応答を生成する。さらに,プロセスベースおよび結果ベース同値性による客観的評価を保証する決定論的共同検証フレームワークを構築した。 EgoBench上で8つのSOTAビデオMLLMエージェントをベンチマークすると、厳しいパフォーマンス天井が示される: 最高のモデルは、最高のパフォーマンスシナリオで30.62%の精度しか達成せず、4つのシナリオで平均19.43%である。最後に,多次元誤差解析により障害モードをアンタングル化し,将来のAIエージェントを前進させる能力ボトルネックを明らかにする。

論文の概要: EgoBench: An Interactive Egocentric Multimodal Benchmark for Tool-Using Agents

関連論文リスト