Fugu-MT 論文翻訳(概要): DiT4DiT: Jointly Modeling Video Dynamics and Actions for Generalizable Robot Control

論文の概要: DiT4DiT: Jointly Modeling Video Dynamics and Actions for Generalizable Robot Control

arxiv url: http://arxiv.org/abs/2603.10448v1
Date: Wed, 11 Mar 2026 06:03:53 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-21 18:33:56.666288
Title: DiT4DiT: Jointly Modeling Video Dynamics and Actions for Generalizable Robot Control
Title（参考訳）: DiT4DiT:汎用ロボット制御のためのビデオダイナミクスとアクションを併用したモデリング
Authors: Teli Ma, Jia Zheng, Zifan Wang, Chuili Jiang, Andy Cui, Junwei Liang, Shuo Yang,
Abstract要約: 本稿では,ビデオ拡散変換器とアクション拡散変換器を結合したエンドツーエンドのビデオ・アクション・モデルであるDiT4DiTを紹介する。 DiT4DiTは、再構成後のフレームに頼る代わりに、ビデオ生成プロセスから中間的なデノイング機能を抽出する。これは最先端の結果を達成し、LIBEROでは98.6%、RoboCasa GR1では50.8%という平均的な成功率に達した。
参考スコア（独自算出の注目度）: 16.562259973551786
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-Language-Action (VLA) models have emerged as a promising paradigm for robot learning, but their representations are still largely inherited from static image-text pretraining, leaving physical dynamics to be learned from comparatively limited action data. Generative video models, by contrast, encode rich spatiotemporal structure and implicit physics, making them a compelling foundation for robotic manipulation. But their potentials are not fully explored in the literature. To bridge the gap, we introduce DiT4DiT, an end-to-end Video-Action Model that couples a video Diffusion Transformer with an action Diffusion Transformer in a unified cascaded framework. Instead of relying on reconstructed future frames, DiT4DiT extracts intermediate denoising features from the video generation process and uses them as temporally grounded conditions for action prediction. We further propose a dual flow-matching objective with decoupled timesteps and noise scales for video prediction, hidden-state extraction, and action inference, enabling coherent joint training of both modules. Across simulation and real-world benchmarks, DiT4DiT achieves state-of-the-art results, reaching average success rates of 98.6% on LIBERO and 50.8% on RoboCasa GR1 while using substantially less training data. On the Unitree G1 robot, it also delivers superior real-world performance and strong zero-shot generalization. Importantly, DiT4DiT improves sample efficiency by over 10x and speeds up convergence by up to 7x, demonstrating that video generation can serve as an effective scaling proxy for robot policy learning. We release code and models at https://dit4dit.github.io/.
Abstract（参考訳）: VLA(Vision-Language-Action)モデルがロボット学習の有望なパラダイムとして登場したが、その表現は静的な画像テキスト事前学習から受け継がれており、物理力学は比較的限られた行動データから学習される。対照的に、生成ビデオモデルは、豊富な時空間構造と暗黙の物理を符号化し、ロボット操作の魅力的な基盤となっている。しかし、その潜在能力は文献で完全には研究されていない。このギャップを埋めるために、我々は、ビデオ拡散変換器とアクション拡散変換器を一体化したフレームワークで結合する、エンドツーエンドのビデオ・アクション・モデルであるDiT4DiTを紹介する。再構成された将来のフレームに頼る代わりに、DiT4DiTはビデオ生成プロセスから中間的聴覚特徴を抽出し、動作予測のための時間的基底条件として使用する。さらに、ビデオ予測、隠れ状態抽出、行動推論のための分離された時間ステップとノイズスケールを備えた二重流れマッチング目標を提案し、両モジュールのコヒーレントな共同訓練を可能にした。 DiT4DiTはシミュレーションと実世界のベンチマークで最先端の結果を達成し、LIBEROでは98.6%、RoboCasa GR1では50.8%で平均的な成功率を達成した。 Unitree G1ロボットでは、優れた現実世界のパフォーマンスと強力なゼロショットの一般化も提供する。重要なことは、DiT4DiTはサンプリング効率を10倍に改善し、コンバージェンスを最大7倍に高速化し、ビデオ生成がロボットポリシー学習の効果的なスケーリングプロキシとして機能することを実証している。コードとモデルはhttps://dit4dit.github.io/で公開しています。

論文の概要: DiT4DiT: Jointly Modeling Video Dynamics and Actions for Generalizable Robot Control

関連論文リスト