Fugu-MT 論文翻訳(概要): MV-WAM: Manifold-Aware World Action Model with Value Augmentation

論文の概要: MV-WAM: Manifold-Aware World Action Model with Value Augmentation

arxiv url: http://arxiv.org/abs/2606.21088v1
Date: Fri, 19 Jun 2026 04:35:18 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-26 08:28:39.135252
Title: MV-WAM: Manifold-Aware World Action Model with Value Augmentation
Title（参考訳）: MV-WAM: 付加価値を伴うマニフォールド対応世界行動モデル
Authors: Jintao Chen, Peidong Jia, Qingpo Wuwu, Jiaming Liu, Mengfei Du, Chun-Kai Fan, Xiaowei Chi, Hao Chen, Chengyu Bai, Zezhong Qian, Hao Wang, Jiajun Cao, Weishi Mi, Xiaozhu Ju, Jian Tang, Shanghang Zhang,
Abstract要約: 本稿では,視覚的予測,行動生成,価値推定をモデル化する新しいエンドツーエンドフレームワークであるMV-WAMを提案する。 MV-WAMは,両腕ロボットの難易度が異なる実世界の4つのタスクにおいて,77.5%の平均成功率を達成した。
参考スコア（独自算出の注目度）: 45.76770487270348
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Achieving robust and generalizable manipulation across diverse environments remains a fundamental challenge in embodied robotics. Recent world action models achieve strong in-domain performance, yet their gains do not extend proportionally to out-of-distribution scenarios. We attribute this to a structural mismatch between visual and action modalities, whose intrinsically heterogeneous manifolds cause joint optimization to disproportionately degrade action robustness under distribution shift. To address this, we propose MV-WAM, a novel end-to-end framework that jointly models visual prediction, action generation, and value estimation designed to effectively leverage video priors during both training and inference for enhanced action generalization. Key to this unification is a cross-modality causal mask that hierarchically grounds actions in predicted video frames and value function tokens in both modalities. To further narrow the generalization gap, MV-WAM adopts a manifold-aware optimization scheme that explicitly accounts for the structural heterogeneity across modalities. Finally, MV-WAM introduces a progress-value regulation mechanism that estimates task completion and detects misalignment between predicted frames and generated actions, enabling the policy to autonomously identify execution deviations and recover through value-guided rollback. On the RoboTwin simulation, MV-WAM achieves a 55.7% mean success rate on random scenarios without any randomized action supervision, outperforming the strongest baseline by 29.3%. MV-WAM achieves a 77.5% mean success rate across four real-world tasks of varying difficulty on a dual-arm robot. Our results demonstrate that manifold-aware cross-modal alignment is essential for robust policy generalization, offering a path toward deployable robotic manipulation.
Abstract（参考訳）: 多様な環境をまたいだ堅牢で汎用的な操作を実現することは、ロボティクスの具体化における根本的な課題である。最近の世界アクションモデルはドメイン内での強いパフォーマンスを実現するが、その利益は分配外のシナリオに比例しない。これは、本質的に不均一な多様体が、分布シフトの下での作用堅牢性を不均等に劣化させるような結合最適化を引き起こすような、視覚的モダリティと行動的モダリティの間の構造的ミスマッチによるものである。そこで本稿では,視覚的予測,行動生成,価値推定を協調的にモデル化する新しいエンドツーエンドフレームワークであるMV-WAMを提案する。この統合の鍵となるのは、モダリティの因果マスクで、予測されたビデオフレームにおけるアクションと、両方のモダリティにおける値関数トークンを階層的に基底にする。一般化ギャップをさらに狭めるために、MV-WAMはモジュラリティ全体の構造的不均一性を明示的に考慮する多様体対応最適化スキームを採用する。最後に、MV-WAMは、タスク完了を推定し、予測されたフレームと生成されたアクションとの不一致を検出する進捗値制御機構を導入し、実行逸脱を自律的に識別し、価値誘導ロールバックを通じて回復する。 RoboTwinシミュレーションでは、MV-WAMはランダムなアクションの監督なしにランダムなシナリオにおける成功率の平均55.7%を達成し、最強のベースラインを29.3%上回った。 MV-WAMは、両腕ロボットの難易度が異なる4つの実世界のタスクで平均77.5%の成功率を達成する。本研究は, モジュール型クロスモーダルアライメントがロバストポリシの一般化に不可欠であることを示し, 展開可能なロボット操作への道筋を提供する。

論文の概要: MV-WAM: Manifold-Aware World Action Model with Value Augmentation

関連論文リスト