Fugu-MT 論文翻訳(概要): Fast-WAM: Do World Action Models Need Test-time Future Imagination?

論文の概要: Fast-WAM: Do World Action Models Need Test-time Future Imagination?

arxiv url: http://arxiv.org/abs/2603.16666v1
Date: Tue, 17 Mar 2026 15:33:43 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-18 17:42:07.377303
Title: Fast-WAM: Do World Action Models Need Test-time Future Imagination?
Title（参考訳）: Fast-WAM: World Action Modelsはテストタイムの将来のイマジネーションを必要とするか?
Authors: Tianyuan Yuan, Zibin Dong, Yicheng Liu, Hang Zhao,
Abstract要約: 世界行動モデル(WAM)は、身体制御のためのビジョン・ランゲージ・アクション(VLA)モデルに代わる有望な代替品として登場した。我々は、WAMがテスト時に明らかに将来的な想像力を必要とするのか、それともトレーニング中に主にビデオモデリングから得られる利益なのかを問う。我々は、トレーニング中にビデオのコトレーニングを継続するが、テスト時に将来の予測を省略するWAMアーキテクチャであるFast-WAMを提案する。
参考スコア（独自算出の注目度）: 39.17692664456295
License: http://creativecommons.org/licenses/by/4.0/
Abstract: World Action Models (WAMs) have emerged as a promising alternative to Vision-Language-Action (VLA) models for embodied control because they explicitly model how visual observations may evolve under action. Most existing WAMs follow an imagine-then-execute paradigm, incurring substantial test-time latency from iterative video denoising, yet it remains unclear whether explicit future imagination is actually necessary for strong action performance. In this paper, we ask whether WAMs need explicit future imagination at test time, or whether their benefit comes primarily from video modeling during training. We disentangle the role of video modeling during training from explicit future generation during inference by proposing \textbf{Fast-WAM}, a WAM architecture that retains video co-training during training but skips future prediction at test time. We further instantiate several Fast-WAM variants to enable a controlled comparison of these two factors. Across these variants, we find that Fast-WAM remains competitive with imagine-then-execute variants, while removing video co-training causes a much larger performance drop. Empirically, Fast-WAM achieves competitive results with state-of-the-art methods both on simulation benchmarks (LIBERO and RoboTwin) and real-world tasks, without embodied pretraining. It runs in real time with 190ms latency, over 4$\times$ faster than existing imagine-then-execute WAMs. These results suggest that the main value of video prediction in WAMs may lie in improving world representations during training rather than generating future observations at test time. Project page: https://yuantianyuan01.github.io/FastWAM/
Abstract（参考訳）: 世界行動モデル(WAM)は、視覚的な観察が行動下でどのように進化するかを明確にモデル化するため、視覚ランゲージ・アクション(VLA)モデルに代わる有望な代替品として出現している。既存のWAMの多くは、イテレーティブなビデオのデノゲーションからかなりのテストタイムの遅延をもたらす、想像=実行のパラダイムに従っている。本稿では、WAMがテスト時に明らかに将来的な想像力を必要とするのか、それともトレーニング中のビデオモデリングから得られるメリットなのかを問う。我々は、トレーニング中にビデオのコトレーニングを継続するが、テスト時に将来の予測を省略するWAMアーキテクチャである \textbf{Fast-WAM} を提案することにより、推論中の明示的な未来の世代からトレーニング中のビデオモデリングの役割を遠ざけている。さらに、これらの2つの要因の制御された比較を可能にするために、いくつかのFast-WAM変異体をインスタンス化する。これらの変種全体では、Fast-WAMは想像=実行時の変種と競合し続けているが、ビデオのコトレーニングを削除することでパフォーマンスが大幅に低下している。実証的には、Fast-WAMは、シミュレーションベンチマーク(LIBEROとRoboTwin)と実世界のタスクの両方で、事前トレーニングを具体化せずに、最先端の手法で競合する結果を得る。 190msのレイテンシでリアルタイムに動作し、既存のImagine-then-execute WAMよりも4$\times$以上高速です。これらの結果から,WAMにおける映像予測の主な価値は,テスト時に将来の観察結果を生成するのではなく,トレーニング中の世界表現を改善することにある可能性が示唆された。プロジェクトページ:https://yuantianyuan01.github.io/FastWAM/

論文の概要: Fast-WAM: Do World Action Models Need Test-time Future Imagination?

関連論文リスト