Fugu-MT 論文翻訳(概要): GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents

論文の概要: GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents

arxiv url: http://arxiv.org/abs/2604.07429v1
Date: Wed, 08 Apr 2026 17:49:03 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-10 18:34:05.478017
Title: GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents
Title（参考訳）: GameWorld:マルチモーダルゲームエージェントの標準化と検証に向けて
Authors: Mingyu Ouyang, Siyuan Hu, Kevin Qinghong Lin, Hwee Tou Ng, Mike Zheng Shou,
Abstract要約: GameWorldは、ブラウザ環境におけるマルチモーダル大言語モデル(MLLM)ゲームエージェントの評価のためのベンチマークである。 2つのゲームエージェントインタフェースが研究され、 (i) キーボードとマウスのコントロールを直接出力するコンピュータ利用エージェント、 (ii) セマンティックアクション空間で作用する汎用マルチモーダルエージェントが研究されている。 18組のモデルとインタフェースのペアによる結果は、最高のパフォーマンスエージェントでさえ、ビデオゲームで人間の能力を達成するには程遠いことを示唆している。
参考スコア（独自算出の注目度）: 76.60994803070436
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Towards an embodied generalist for real-world interaction, Multimodal Large Language Model (MLLM) agents still suffer from challenging latency, sparse feedback, and irreversible mistakes. Video games offer an ideal testbed with rich visual observations and closed-loop interaction, demanding fine-grained perception, long-horizon planning, and precise control. However, systematically evaluating these capabilities is currently hindered by heterogeneous action interfaces and heuristic verification. To this end, we introduce GameWorld, a benchmark designed for standardized and verifiable evaluation of MLLMs as generalist game agents in browser environments. Two game agent interfaces are studied: (i) computer-use agents that directly emit keyboard and mouse controls, and (ii) generalist multimodal agents that act in a semantic action space via deterministic Semantic Action Parsing. GameWorld contains 34 diverse games and 170 tasks, each paired with state-verifiable metrics for outcome-based evaluation. The results across 18 model-interface pairs suggest that even the best performing agent is far from achieving human capabilities on video games. Extensive experiments of repeated full-benchmark reruns demonstrate the robustness of the benchmark, while further studies on real-time interaction, context-memory sensitivity, and action validity expose more challenges ahead for game agents. Together, by offering a standardized, verifiable, and reproducible evaluation framework, GameWorld lays a robust foundation for advancing research on multimodal game agents and beyond. The project page is at https://gameworld-bench.github.io.
Abstract（参考訳）: 実世界のインタラクションの具体的一般化に向けて、Multimodal Large Language Model (MLLM)エージェントは依然として、困難なレイテンシ、少ないフィードバック、そして不可逆的なミスに悩まされている。ビデオゲームは、リッチな視覚観察とクローズドループインタラクションを備えた理想的なテストベッドを提供する。しかし、これらの機能を体系的に評価することは、現在ヘテロジニアスなアクションインタフェースとヒューリスティックな検証によって妨げられている。そこで我々は,ブラウザ環境における汎用ゲームエージェントとしてのMLLMの標準化と検証のためのベンチマークであるGameWorldを紹介した。 2つのゲームエージェントインタフェースが研究されている。一キーボード及びマウスの制御を直接出力するコンピュータ使用エージェント (2)決定論的セマンティック・アクション・パーシング(Semantic Action Parsing)を通して意味的行動空間で作用する汎用的マルチモーダル・エージェント。 GameWorldには34の多様なゲームと170のタスクがあり、それぞれが結果に基づいた評価のための状態検証可能なメトリクスと組み合わせている。 18組のモデルとインタフェースのペアによる結果は、最高のパフォーマンスエージェントでさえ、ビデオゲームで人間の能力を達成するには程遠いことを示唆している。フルベンチマークの再実行を繰り返す実験は、ベンチマークの堅牢性を示す一方で、リアルタイムインタラクション、コンテキストメモリの感度、アクションの妥当性に関するさらなる研究は、ゲームエージェントの今後の課題を浮き彫りにしている。同時に、標準化され、検証可能で再現可能な評価フレームワークを提供することで、GameWorldはマルチモーダルゲームエージェント以上の研究を進めるための堅牢な基盤を構築している。プロジェクトページはhttps://gameworld-bench.github.io.comにある。

論文の概要: GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents

関連論文リスト