Fugu-MT 論文翻訳(概要): Do World Action Models Generalize Better than VLAs? A Robustness Study

論文の概要: Do World Action Models Generalize Better than VLAs? A Robustness Study

arxiv url: http://arxiv.org/abs/2603.22078v2
Date: Wed, 01 Apr 2026 01:49:52 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-02 16:44:31.562127
Title: Do World Action Models Generalize Better than VLAs? A Robustness Study
Title（参考訳）: 世界行動モデルはVLAよりも一般化されているか? : ロバストネススタディ
Authors: Zhanguang Zhang, Zhiyuan Li, Behnam Rahmati, Rui Heng Yang, Yintao Ma, Amir Rasouli, Sajjad Pakdamansavoji, Yangzheng Wu, Lingfeng Zhang, Tongtong Cao, Feng Wen, Xinyu Wang, Xingyue Quan, Yingxue Zhang,
Abstract要約: 視覚言語アクション(VLA)は、様々なロボットタスクで顕著な成功を収めた。世界行動モデル(WAM)は、将来の状態を予測するために大量のビデオデータに基づいて訓練された世界モデルに基づいて構築される。 LIBERO-Plus と RoboTwin 2.0-Plus のベンチマークにおいて,様々な視覚的・言語的摂動による性能評価を行った。
参考スコア（独自算出の注目度）: 25.418384276142223
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Robot action planning in the real world is challenging as it requires not only understanding the current state of the environment but also predicting how it will evolve in response to actions. Vision-language-action (VLA), which repurpose large-scale vision-language models for robot action generation using action experts, have achieved notable success across a variety of robotic tasks. Nevertheless, their performance remains constrained by the scope of their training data, exhibiting limited generalization to unseen scenarios and vulnerability to diverse contextual perturbations. More recently, world models have been revisited as an alternative to VLAs. These models, referred to as world action models (WAMs), are built upon world models that are trained on large corpora of video data to predict future states. With minor adaptations, their latent representation can be decoded into robot actions. It has been suggested that their explicit dynamic prediction capacity, combined with spatiotemporal priors acquired from web-scale video pretraining, enables WAMs to generalize more effectively than VLAs. In this paper, we conduct a comparative study of prominent state-of-the-art VLA policies and recently released WAMs. We evaluate their performance on the LIBERO-Plus and RoboTwin 2.0-Plus benchmarks under various visual and language perturbations. Our results show that WAMs achieve strong robustness, with LingBot-VA reaching 74.2% success rate on RoboTwin 2.0-Plus and Cosmos-Policy achieving 82.2% on LIBERO-Plus. While VLAs such as $π_{0.5}$ can achieve comparable robustness on certain tasks, they typically require extensive training with diverse robotic datasets and varied learning objectives. Hybrid approaches that partially incorporate video-based dynamic learning exhibit intermediate robustness, highlighting the importance of how video priors are integrated.
Abstract（参考訳）: 実世界におけるロボットの行動計画は、環境の現状を理解するだけでなく、行動に反応してどのように進化するかを予測する必要があるため、難しい。 VLA(Vision-Language-action)は、アクションエキスパートを用いたロボットアクション生成のための大規模ビジョン言語モデルを再利用し、様々なロボットタスクで顕著な成功を収めた。それでも、彼らのパフォーマンスはトレーニングデータの範囲によって制約され続けており、目に見えないシナリオやさまざまなコンテキスト摂動に対する脆弱性への限定的な一般化を示している。近年、世界モデルはVLAに代わるものとして再検討されている。これらのモデルは、WAM(World Action Model)と呼ばれ、将来の状態を予測するために大量のビデオデータに基づいて訓練された世界モデルの上に構築されている。マイナーな適応では、その潜在表現はロボットの動作にデコードされる。 Web スケールのビデオ事前学習から得られる時空間的事前学習と組み合わせることで,WAM が VLA よりも効果的に一般化できることが示唆されている。本稿では、最先端のVLA政策の比較研究を行い、最近WAMを公表した。 LIBERO-Plus と RoboTwin 2.0-Plus のベンチマークにおいて,様々な視覚的・言語的摂動による性能評価を行った。 LingBot-VAはRoboTwin 2.0-Plusで74.2%,Cosmos-Policyは82.2%に達した。 π_{0.5}$のようなVLAは、特定のタスクにおいて同等の堅牢性を達成することができるが、通常、多様なロボットデータセットと様々な学習目標を用いた広範なトレーニングが必要である。ビデオベースの動的学習を部分的に取り入れたハイブリッドアプローチは、中間的な堅牢性を示し、ビデオの事前統合の重要性を強調している。

論文の概要: Do World Action Models Generalize Better than VLAs? A Robustness Study

関連論文リスト