Fugu-MT 論文翻訳(概要): Mask World Model: Predicting What Matters for Robust Robot Policy Learning

論文の概要: Mask World Model: Predicting What Matters for Robust Robot Policy Learning

arxiv url: http://arxiv.org/abs/2604.19683v1
Date: Tue, 21 Apr 2026 17:05:37 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-22 22:41:49.891782
Title: Mask World Model: Predicting What Matters for Robust Robot Policy Learning
Title（参考訳）: Mask World Model:ロバストなロボットポリシー学習に何が必要かを予測する
Authors: Yunfan Lou, Xiaowei Chi, Xiaojie Zhang, Zezhong Qian, Chengxuan Li, Rongyu Zhang, Yaoxu Lyu, Guoyu Song, Chuyao Fu, Haoxuan Xu, Pengwei Wang, Shanghang Zhang,
Abstract要約: Mask World Model (MWM) は、大規模ビデオ生成事前学習モデルの一般化である。 MWMは,テクスチャ情報損失に対する優れた一般化能力と堅牢なレジリエンスを示す。
参考スコア（独自算出の注目度）: 31.96162737409163
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: World models derived from large-scale video generative pre-training have emerged as a promising paradigm for generalist robot policy learning. However, standard approaches often focus on high-fidelity RGB video prediction, this can result in overfitting to irrelevant factors, such as dynamic backgrounds and illumination changes. These distractions reduce the model's ability to generalize, ultimately leading to unreliable and fragile control policies. To address this, we introduce the Mask World Model (MWM), which leverages video diffusion architectures to predict the evolution of semantic masks instead of pixels. This shift imposes a geometric information bottleneck, forcing the model to capture essential physical dynamics and contact relations while filtering out visual noise. We seamlessly integrate this mask dynamics backbone with a diffusion-based policy head to enable robust end-to-end control. Extensive evaluations demonstrate the superiority of MWM on the LIBERO and RLBench simulation benchmarks, significantly outperforming the state-of-the-art RGB-based world models. Furthermore, real-world experiments and robustness evaluation (via random token pruning) reveal that MWM exhibits superior generalization capabilities and robust resilience to texture information loss.
Abstract（参考訳）: 大規模ビデオ生成事前学習から派生した世界モデルは、ジェネラリストロボット政策学習の有望なパラダイムとして現れている。しかし、標準的なアプローチは高忠実度RGBビデオ予測に重点を置いていることが多いため、動的背景や照明変更など、無関係な要因に過度に適合する可能性がある。これらの混乱により、モデルが一般化する能力が低下し、最終的に信頼性の低い脆弱な制御ポリシーがもたらされる。これを解決するために,ビデオ拡散アーキテクチャを活用し,画素の代わりにセマンティックマスクの進化を予測するMask World Model (MWM)を導入する。このシフトは、幾何学的な情報のボトルネックを課し、視覚ノイズを除去しながら、本質的な物理力学と接触関係を捉えることを強制する。我々はこのマスクダイナミクスを拡散ベースのポリシーヘッドとシームレスに統合し、堅牢なエンドツーエンド制御を可能にする。 LIBERO と RLBench のシミュレーションベンチマークではMWM が優れており、最先端の RGB ベースの世界モデルよりも優れていた。さらに、実世界の実験と(ランダムトークンプルーニングによる)ロバストネス評価により、MWMはより優れた一般化能力とテクスチャ情報損失に対する堅牢なレジリエンスを示すことが明らかになった。

論文の概要: Mask World Model: Predicting What Matters for Robust Robot Policy Learning

関連論文リスト