Fugu-MT 論文翻訳(概要): Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning

論文の概要: Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning

arxiv url: http://arxiv.org/abs/2605.00347v1
Date: Fri, 01 May 2026 02:05:56 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-04 17:43:28.819482
Title: Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning
Title（参考訳）: Odysseus: 強化学習によるゲームにおけるVLMの100以上のターン意思決定へのスケーリング
Authors: Chengshuai Shi, Wenzhe Li, Xinran Liang, Yizhou Lu, Wenjia Yang, Ruirong Feng, Seth Karten, Ziran Yang, Zihan Ding, Gabriel Sarch, Danqi Chen, Karthik Narasimhan, Chi Jin,
Abstract要約: 本研究では,スーパーマリオランドにおける長期意思決定のための視覚言語モデル(VLM)の学習について検討する。本稿では,軽量なターンレベルの批評家によるPPOの適応版を提案し,トレーニングの安定性とサンプル効率を大幅に向上させる。我々は,VLMエージェントのオープントレーニングフレームワークであるOdysseusを紹介し,ゲーム内の複数のレベルにおいて,実質的なゲインを達成する。
参考スコア（独自算出の注目度）: 50.464623632604976
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Given the rapidly growing capabilities of vision-language models (VLMs), extending them to interactive decision-making tasks such as video games has emerged as a promising frontier. However, existing approaches either rely on large-scale supervised fine-tuning (SFT) on human trajectories or apply reinforcement learning (RL) only in relatively short-horizon settings (typically around 20--30 turns). In this work, we study RL-based training of VLMs for long-horizon decision-making in Super Mario Land, a visually grounded environment requiring 100+ turns of interaction with coordinated perception, reasoning, and action. We begin with a systematic investigation of key algorithmic components and propose an adapted variant of PPO with a lightweight turn-level critic, which substantially improves training stability and sample efficiency over critic-free methods such as GRPO and Reinforce++. We further show that pretrained VLMs provide strong action priors, significantly improving sample efficiency during RL training and reducing the need for manual design choices such as action engineering, compared to classical deep RL trained from scratch. Building on these insights, we introduce Odysseus, an open training framework for VLM agents, achieving substantial gains across multiple levels of the game and at least 3 times average game progresses than frontier models. Moreover, the trained models exhibit consistent improvements under both in-game and cross-game generalization settings, while maintaining general-domain capabilities. Overall, our results identify key ingredients for making RL stable and effective in long-horizon, multi-modal settings, and provide practical guidance for developing VLMs as embodied agents.
Abstract（参考訳）: 視覚言語モデル(VLM)の能力が急速に向上していることを踏まえ、ビデオゲームのようなインタラクティブな意思決定タスクに拡張することが有望なフロンティアとして現れている。しかし、既存のアプローチでは、人間の軌道に大規模な微調整(SFT)を頼りにするか、比較的短距離(典型的には20-30回転)でのみ強化学習(RL)を適用している。本研究では,スーパーマリオランドにおける長期的意思決定のためのVLMのRLベーストレーニングについて検討した。我々はまず,鍵となるアルゴリズム成分の体系的な研究から始め,軽量なターンレベルの評論家によるPPOの適応版を提案し,GRPOやReinforce++のような批判のない手法よりもトレーニングの安定性とサンプル効率を大幅に向上させる。さらに,プレトレーニングされたVLMは,RLトレーニング時の試料効率を大幅に向上し,動作工学などの手動設計選択の必要性をゼロからトレーニングした古典的深度RLと比較して低減できることを示す。これらの知見に基づいて,VLMエージェントのオープントレーニングフレームワークであるOdysseusを導入し,フロンティアモデルよりも複数のレベルのゲームにおいて,少なくとも3倍の平均的なゲーム進行を達成した。さらに、トレーニングされたモデルは、汎用ドメイン機能を維持しながら、ゲーム内およびゲーム間一般化設定の両方で一貫した改善を示す。以上の結果から, 長期・マルチモーダル環境においてRLを安定かつ有効にするための重要な要素を同定し, 組込み剤としてVLMを開発するための実践的ガイダンスを提供する。

論文の概要: Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning

関連論文リスト