Fugu-MT 論文翻訳(概要): EgoEsportsQA: An Egocentric Video Benchmark for Perception and Reasoning in Esports

論文の概要: EgoEsportsQA: An Egocentric Video Benchmark for Perception and Reasoning in Esports

arxiv url: http://arxiv.org/abs/2604.12320v2
Date: Mon, 20 Apr 2026 10:38:20 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-21 19:27:32.399803
Title: EgoEsportsQA: An Egocentric Video Benchmark for Perception and Reasoning in Esports
Title（参考訳）: EgoEsportsQA: スポーツにおける知覚と推論のためのエゴセントリックなビデオベンチマーク
Authors: Jianzhe Ma, Zhonghao Cao, Shangkui Chen, Yichen Xu, Wenxuan Wang, Qin Jin,
Abstract要約: EgoEsportsQAは、専門家のエスポート知識に対する認識と推論を基盤とする、先駆的なビデオ質問回答(QA)ベンチマークである。我々は、スケーラブルな6段階のパイプラインを通じて、3つのファーストパーソンシューティングゲーム間でのプロの試合から、1,745の高品質QAペアをキュレートする。
参考スコア（独自算出の注目度）: 45.11533142825268
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While video large language models (Video-LLMs) excel in understanding slow-paced, real-world egocentric videos, their capabilities in high-velocity, information-dense virtual environments remain under-explored. Existing benchmarks focus on daily activities, yet lack a rigorous testbed for evaluating fast, rule-bound reasoning in virtual scenarios. To fill this gap, we introduce EgoEsportsQA, a pioneering video question-answering (QA) benchmark for grounding perception and reasoning in expert esports knowledge. We curate 1,745 high-quality QA pairs from professional matches across 3 first-person shooter games via a scalable six-stage pipeline. These questions are structured into a two-dimensional decoupled taxonomy: 11 sub-tasks in the cognitive capability dimension (covering perception and reasoning levels) and 6 sub-tasks in the esports knowledge dimension. Comprehensive evaluations of state-of-the-art Video-LLMs reveal that current models still fail to achieve satisfactory performance, with the best model only 71.58%. The results expose notable gaps across both axes: models exhibit stronger capabilities in basic visual perception than in deep tactical reasoning, and they grasp overall macro-progression better than fine-grained micro-operations. Extensive ablation experiments demonstrate the intrinsic weaknesses of current Video-LLM architectures. Further analysis suggests that our dataset not only reveals the connections between real-world and virtual egocentric domains, but also offers guidance for optimizing downstream esports applications, thereby fostering the future advancement of Video-LLMs in various egocentric environments.
Abstract（参考訳）: ビデオ大言語モデル(Video-LLMs)は、遅いペースで現実のエゴセントリックなビデオを理解するのに優れていますが、高速度で情報密度の仮想環境におけるそれらの能力は、まだ未調査のままです。既存のベンチマークは日々のアクティビティに重点を置いているが、仮想シナリオにおける高速でルールバウンドな推論を評価するための厳格なテストベッドは欠如している。このギャップを埋めるために、専門家のエスポート知識における認識と推論の基盤となる、先駆的なビデオ質問回答(QA)ベンチマークであるEgoEsportsQAを紹介する。我々は、スケーラブルな6段階のパイプラインを通じて、3つのファーストパーソンシューティングゲーム間でのプロの試合から、1,745の高品質QAペアをキュレートする。これらの質問は、認知能力の次元における11のサブタスク(知覚と推論のレベルをカバーする)とエスポートの知識の次元における6のサブタスクという2次元の分離された分類に構成されている。最新のビデオLLMの総合評価では、現在のモデルでは十分な性能が得られず、最高のモデルはわずか71.58%である。モデルは、深い戦術的推論よりも基礎的な視覚知覚において強い能力を示し、細粒度のマイクロ操作よりも全体的なマクロ・プログレッションを把握している。大規模なアブレーション実験は、現在のビデオ-LLMアーキテクチャの固有の弱点を実証している。さらに分析した結果,我々のデータセットは,実世界と仮想エゴセントリックなドメイン間の関係を明らかにするだけでなく,下流のエスポートアプリケーションを最適化するためのガイダンスも提供し,様々なエゴセントリックな環境におけるビデオ-LLMの今後の発展を促進することが示唆された。

論文の概要: EgoEsportsQA: An Egocentric Video Benchmark for Perception and Reasoning in Esports

関連論文リスト