Fugu-MT 論文翻訳(概要): Yanyun-3: Enabling Cross-Platform Strategy Game Operation with Vision-Language Models

論文の概要: Yanyun-3: Enabling Cross-Platform Strategy Game Operation with Vision-Language Models

arxiv url: http://arxiv.org/abs/2511.12937v1
Date: Mon, 17 Nov 2025 03:45:15 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-18 14:36:24.643035
Title: Yanyun-3: Enabling Cross-Platform Strategy Game Operation with Vision-Language Models
Title（参考訳）: Yanyun-3:ビジョンランゲージモデルによるクロスプラットフォーム戦略ゲーム操作の実現
Authors: Guoyan Wang, Yanyan Huang, Chunlin Chen, Lifeng Wang, Yuxiang Sun,
Abstract要約: 本稿では,戦略ゲームにおける自律的クロスプラットフォーム操作を実現する汎用エージェントフレームワークYanyun-3を紹介する。 Qwen2.5-VLの視覚言語推論とUI-TARSの正確な実行能力を統合することで、Yanyun-3はコアタスクをうまく実行した。静止画像(MV+S)を混合しながら、マルチイメージとビデオデータを融合するハイブリッド戦略が、フルフュージョンを大幅に上回ることがわかった。
参考スコア（独自算出の注目度）: 30.591909012704978
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Automated operation in cross-platform strategy games demands agents with robust generalization across diverse user interfaces and dynamic battlefield conditions. While vision-language models (VLMs) have shown considerable promise in multimodal reasoning, their application to complex human-computer interaction scenarios--such as strategy gaming--remains largely unexplored. Here, we introduce Yanyun-3, a general-purpose agent framework that, for the first time, enables autonomous cross-platform operation across three heterogeneous strategy game environments. By integrating the vision-language reasoning of Qwen2.5-VL with the precise execution capabilities of UI-TARS, Yanyun-3 successfully performs core tasks including target localization, combat resource allocation, and area control. Through systematic ablation studies, we evaluate the effects of various multimodal data combinations--static images, multi-image sequences, and videos--and propose the concept of combination granularity to differentiate between intra-sample fusion and inter-sample mixing strategies. We find that a hybrid strategy, which fuses multi-image and video data while mixing in static images (MV+S), substantially outperforms full fusion: it reduces inference time by 63% and boosts the BLEU-4 score by a factor of 12 (from 4.81% to 62.41%, approximately 12.98x). Operating via a closed-loop pipeline of screen capture, model inference, and action execution, the agent demonstrates strong real-time performance and cross-platform generalization. Beyond providing an efficient solution for strategy game automation, our work establishes a general paradigm for enhancing VLM performance through structured multimodal data organization, offering new insights into the interplay between static perception and dynamic reasoning in embodied intelligence.
Abstract（参考訳）: クロスプラットフォーム戦略ゲームにおける自動操作は、多様なユーザインタフェースと動的戦場条件にまたがる堅牢な一般化を必要とするエージェントを要求する。視覚言語モデル(VLM)は、マルチモーダル推論においてかなり有望であるが、戦略ゲームのような複雑な人間とコンピュータの相互作用シナリオへの応用は、ほとんど探索されていない。本稿では,3つの異種戦略ゲーム環境にまたがる自律的クロスプラットフォーム操作を実現する汎用エージェントフレームワークであるYanyun-3を紹介する。 Qwen2.5-VLの視覚言語推論とUI-TARSの正確な実行能力を統合することで、Yanyun-3はターゲットのローカライゼーション、戦闘資源割り当て、エリアコントロールなどのコアタスクをうまく実行した。組織的アブレーション研究を通じて, 静止画像, マルチイメージシーケンス, ビデオの多モードデータ組み合わせの効果を評価し, サンプル内融合とサンプル間混合戦略を区別するために, 組み合わせの粒度の概念を提案する。静止画像(MV+S)を混合しながら、マルチイメージとビデオデータを融合するハイブリッド戦略は、推論時間を63%削減し、BLEU-4スコアを12倍(4.81%から62.41%、約12.98倍)に向上させる。スクリーンキャプチャ、モデル推論、アクション実行のクローズドループパイプラインを介して操作すると、エージェントは強力なリアルタイムパフォーマンスとクロスプラットフォームの一般化を示す。我々の研究は、戦略ゲーム自動化のための効率的なソリューションを提供するだけでなく、構造化マルチモーダルデータ組織を通してVLMのパフォーマンスを向上させるための一般的なパラダイムを確立し、インボディードインテリジェンスにおける静的知覚と動的推論の相互作用に関する新たな洞察を提供する。

論文の概要: Yanyun-3: Enabling Cross-Platform Strategy Game Operation with Vision-Language Models

関連論文リスト