Fugu-MT 論文翻訳(概要): MotuBrain: An Advanced World Action Model for Robot Control

論文の概要: MotuBrain: An Advanced World Action Model for Robot Control

arxiv url: http://arxiv.org/abs/2604.27792v2
Date: Fri, 01 May 2026 08:30:06 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-04 13:37:10.935567
Title: MotuBrain: An Advanced World Action Model for Robot Control
Title（参考訳）: MotuBrain:ロボット制御のための高度な世界行動モデル
Authors: MotuBrain Team, Chendong Xiang, Fan Bao, Haitian Liu, Hengkai Tan, Hongzhe Bi, James Li, Jiabao Liu, Jingrui Pang, Kiro Jing, Louis Liu, Mengchen Cai, Rongxu Cui, Ruowen Zhao, Runqing Wang, Shuhe Huang, Yao Feng, Yinze Rong, Zeyuan Wang, Jun Zhu,
Abstract要約: We present MotuBrain, a unified World Action Model that jointly model video and action under a UniDiffuser formulation。単一のモデルは、ポリシー学習、世界モデリング、ビデオ生成、逆ダイナミクス、共同ビデオアクション予測をサポートする。 Motus上に構築されているMotuBrainは、言語と相互作用の結合を強くするための独立したテキストストリームである、統一されたマルチビューモデリングも導入している。我々の推論スタックは、ステップの削減、コンパイル、FP8量子化、DiTキャッシュ、V2Aスタイルのアクション専用推論、リアルタイムチャンククループ実行を組み合わせたものです。
参考スコア（独自算出の注目度）: 23.733029557644354
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language-Action (VLA) models generalize semantically well but often lack fine-grained modeling of world dynamics. We present MotuBrain, a unified World Action Model that jointly models video and action under a UniDiffuser formulation with a three-stream Mixture-of-Transformers architecture. A single model supports policy learning, world modeling, video generation, inverse dynamics, and joint video-action prediction, while scaling to heterogeneous multimodal data such as video-only, task-agnostic, and cross-embodiment robot data. Building on Motus, MotuBrain further introduces unified multiview modeling, an independent text stream for stronger language-action coupling, a shared cross-embodiment action representation, and an efficient post-training and deployment recipe for long-horizon real-world control. Our inference stack combines step reduction, compilation, FP8 quantization, DiT caching, V2A-style action-only inference, and real-time chunked closed-loop execution, achieving over 50x speedup over a naive baseline and up to 11 Hz inference. Experimentally, MotuBrain achieves 95.8% and 96.1% average success on RoboTwin 2.0 under clean and randomized settings, respectively, attains the strongest reported EWMScore in our WorldArena comparison, and adapts to new humanoid embodiments with only 50--100 trajectories. These results show that unified world action models can scale in generality, predictive accuracy, and real-world deployability.
Abstract（参考訳）: VLA(Vision-Language-Action)モデルは意味論的によく一般化されるが、世界力学のきめ細かいモデリングを欠いていることが多い。 We present MotuBrain, a unified World Action Model that togetherly model video and action under a UniDiffuser formulation with a three-stream Mixture-of-Transformers architecture。単一のモデルは、ポリシー学習、世界モデリング、ビデオ生成、逆ダイナミクス、共同ビデオアクション予測をサポートし、一方、ビデオのみ、タスク非依存、クロスエボディメントロボットデータなどの異種マルチモーダルデータにスケーリングする。 Motus上に構築されたMotuBrainは、さらに統合されたマルチビューモデリング、より強力な言語-アクション結合のための独立したテキストストリーム、共有されたクロスエボディメントアクション表現、長距離現実世界制御のための効率的なポストトレーニングとデプロイメントのレシピを導入している。我々の推論スタックは、ステップリダクション、コンパイル、FP8量子化、DiTキャッシュ、V2Aスタイルのアクションオンリー推論、およびリアルタイムのチャンクククループ実行を組み合わせ、単純なベースライン上で50倍以上の高速化と最大11Hzの推論を実現しています。実験的に、MotuBrainは、クリーンな設定とランダムな設定で、RoboTwin 2.0で平均95.8%、96.1%の成功を達成し、WorldArena比較で報告された最強のEWMScoreを獲得し、50-100の軌道しか持たない新しいヒューマノイドエボディメントに適応する。これらの結果は、統一された世界行動モデルが、一般性、予測精度、実世界の展開可能性においてスケール可能であることを示している。

論文の概要: MotuBrain: An Advanced World Action Model for Robot Control

関連論文リスト