Fugu-MT 論文翻訳(概要): Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games

論文の概要: Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games

arxiv url: http://arxiv.org/abs/2606.19338v1
Date: Wed, 17 Jun 2026 17:59:34 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-18 17:16:51.306328
Title: Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games
Title（参考訳）: 制御可能な非マルコフゲームにおける多モード大言語モデルの評価
Authors: Shengyuan Ding, Xilin Wei, Xinyu Fang, Haodong Duan, Dahua Lin, Jiaqi Wang, Yuhang Zang,
Abstract要約: RNG-Benchは、過去の観測を再構築するベースモデルの能力を分離するために設計されたベンチマークスイートである。 RNG-Benchには2つの補完ゲームがある: マッチングペア(英語版) - 特定の場所でカードのIDを短期間明らかにする) と、エゴセントリックなビューを空間地図に統合する3D Maze である。最も難しい構成では、約128Kのトークンと350のイメージ入力のコンテキストが必要であり、フロンティアMLLMによる飽和には程遠いままである。
参考スコア（独自算出の注目度）: 69.57330692969543
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Deploying multimodal foundation models as closed-loop policies increasingly requires conditioning actions on observations that are no longer visible. However, existing benchmarks either expose the full state, conflate hidden-state reconstruction with other agent skills, or test recall only after an episode has ended. We introduce RNG-Bench (Reconstructive Non-Markov Games), a benchmark suite designed to isolate a base model's ability to reconstruct past observations and act on them during multi-step interaction. RNG-Bench includes two complementary games: Matching Pairs, where card identities briefly revealed at specific locations must later be recalled, and 3D Maze, where egocentric views must be integrated into a spatial map. Both games are evaluated under a unified harness with three controlled difficulty axes: grid size, visual pattern, and observation modality. The benchmark further introduces a head-to-head duel protocol to control for instance-level variance and a Memory Gap metric that disentangles forgetting from poor action selection. The hardest configurations require contexts of roughly 128K tokens and 350 image inputs per episode, and remain far from saturated by frontier MLLMs. Memory Gap analysis shows that most residual errors stem from forgetting earlier observations rather than from suboptimal decision making. Finally, fine-tuning Qwen3.5-9B on optimal-policy rollouts and filtered model demonstrations improves performance on RNG-Bench and transfers to existing benchmarks without degrading general multimodal capability.
Abstract（参考訳）: クローズドループポリシとしてマルチモーダルファンデーションモデルをデプロイするには、もはや見えない観察に対する条件付けアクションが必要になる。しかし、既存のベンチマークは、完全な状態を公開するか、他のエージェントスキルと隠された状態の再構築を説明するか、エピソードが終わった後にのみテストリコールを行う。 RNG-Bench (Reconstructive Non-Markov Games) は,複数段階の相互作用において,過去の観測を再構築し,それらに作用するベースモデルの能力を分離するベンチマークスイートである。 RNG-Benchには2つの補完ゲームがある: マッチングペア(英語版) - 特定の場所でカードのIDを短期間明らかにする) と、エゴセントリックなビューを空間地図に統合する3D Maze である。両ゲームは、グリッドサイズ、視覚パターン、観察モダリティの3つのコントロールされた難易度軸で統一されたハーネスで評価される。ベンチマークではさらに、インスタンスレベルの分散を制御するためのヘッド・ツー・ヘッドのデュエルプロトコルや、アクション選択の貧弱さから忘れることを妨げるメモリギャップメトリックも導入されている。最も難しい構成では、約128Kのトークンと350のイメージ入力のコンテキストが必要であり、フロンティアMLLMによる飽和には程遠いままである。メモリギャップ分析は、残差のほとんどは、最適下決定ではなく、以前の観察を忘れることに由来することを示している。最後に、最適ポリティクスのロールアウトとフィルタモデルデモに関する微調整Qwen3.5-9Bは、RNG-Benchの性能を改善し、一般的なマルチモーダル能力を低下させることなく既存のベンチマークに転送する。

論文の概要: Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games

関連論文リスト