Fugu-MT 論文翻訳(概要): GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents

論文の概要: GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents

arxiv url: http://arxiv.org/abs/2603.24329v1
Date: Wed, 25 Mar 2026 14:10:45 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-26 21:06:11.323645
Title: GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents
Title（参考訳）: GameplayQA:3次元仮想エージェントの高精度POV同期マルチビデオ理解のためのベンチマークフレームワーク
Authors: Yunzhe Wang, Runhui Xu, Kexin Zheng, Tianyi Zhang, Jayavibhav Niranjan Kogundi, Soham Hans, Volkan Ustun,
Abstract要約: 本稿では,エージェント中心の認識と推論をビデオ理解を通じて評価するフレームワークであるGameplayQAを紹介する。我々は,自己,他エージェント,世界という三進的システムを中心に構築された状態,行動,イベントの同時キャプションを同期した,1.22ラベル/秒のマルチプレイヤー3Dゲームプレイビデオを高密度に注釈付けする。これらのアノテーションを用いて,3段階の認知複雑性に分類された2.4Kの診断QAペアを改良し,構造的障害分類を行った。
参考スコア（独自算出の注目度）: 4.920953895710103
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal LLMs are increasingly deployed as perceptual backbones for autonomous agents in 3D environments, from robotics to virtual worlds. These applications require agents to perceive rapid state changes, attribute actions to the correct entities, and reason about concurrent multi-agent behaviors from a first-person perspective, capabilities that existing benchmarks do not adequately evaluate. We introduce GameplayQA, a framework for evaluating agentic-centric perception and reasoning through video understanding. Specifically, we densely annotate multiplayer 3D gameplay videos at 1.22 labels/second, with time-synced, concurrent captions of states, actions, and events structured around a triadic system of Self, Other Agents, and the World, a natural decomposition for multi-agent environments. From these annotations, we refined 2.4K diagnostic QA pairs organized into three levels of cognitive complexity, accompanied by a structured distractor taxonomy that enables fine-grained analysis of where models hallucinate. Evaluation of frontier MLLMs reveals a substantial gap from human performance, with common failures in temporal and cross-video grounding, agent-role attribution, and handling the decision density of the game. We hope GameplayQA stimulates future research at the intersection of embodied AI, agentic perception, and world modeling.
Abstract（参考訳）: マルチモーダルLSMは、ロボット工学から仮想世界に至るまで、3D環境における自律エージェントの知覚バックボーンとして、ますます多くデプロイされている。これらのアプリケーションには、エージェントが素早い状態変化、正しいエンティティに対する属性アクション、そして既存のベンチマークが適切に評価していない機能である、一人称視点からの同時マルチエージェント動作の理由を理解する必要がある。本稿では,エージェント中心の認識と推論をビデオ理解を通じて評価するフレームワークであるGameplayQAを紹介する。具体的には,マルチプレイヤーの3Dゲームプレイ映像を1.22ラベル/秒で濃密に注釈付けし,時間同期,同時キャプションによる状態・行動・事象を自己・他エージェント・世界三進系を中心に構成し,マルチエージェント環境の自然な分解を行う。これらのアノテーションから,3段階の認知複雑性に分類された2.4Kの診断QAペアを改良し,モデルが幻覚する場所のきめ細かい分析を可能にする構造的障害分類を行った。フロンティアMLLMの評価は、時間的およびクロスビデオ的なグラウンド、エージェント・ロールの属性、ゲームの決定密度の扱いにおいて共通の失敗を伴う、人間のパフォーマンスとの大きなギャップを明らかにしている。 GameplayQAは、エンボディドAI、エージェント認識、そして世界モデリングの交差点における将来の研究を促進することを願っている。

論文の概要: GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents

関連論文リスト