Fugu-MT 論文翻訳(概要): Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model

論文の概要: Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model

arxiv url: http://arxiv.org/abs/2510.27607v1
Date: Fri, 31 Oct 2025 16:32:12 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-03 17:52:16.166751
Title: Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model
Title（参考訳）: World-model Augmented Vision-Language-Action Modelのためのデュアルストリーム拡散
Authors: John Won, Kyungmin Lee, Huiwon Jang, Dongyoung Kim, Jinwoo Shin,
Abstract要約: 本稿では,モダリティ競合に対処し,ビジョン・ランゲージ・アクションモデルの性能を向上させるために,Dual-STreamfusion (DUST)を提案する。 DUSTはベースラインメソッドよりも最大6%向上しますが、テストタイムスケーリングアプローチではさらに2-5%向上しています。 Franka Research 3による実世界のタスクでは、DUSTは成功率を13%改善し、シミュレーションを超えてその効果を確認している。
参考スコア（独自算出の注目度）: 62.889356203346985
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recently, augmenting Vision-Language-Action models (VLAs) with world modeling has shown promise in improving robotic policy learning. However, it remains challenging to jointly predict next-state observations and action sequences because of the inherent difference between the two modalities. To address this, we propose DUal-STream diffusion (DUST), a world-model augmented VLA framework that handles the modality conflict and enhances the performance of VLAs across diverse tasks. Specifically, we propose a multimodal diffusion transformer architecture that explicitly maintains separate modality streams while still enabling cross-modal knowledge sharing. In addition, we introduce independent noise perturbations for each modality and a decoupled flow-matching loss. This design enables the model to learn the joint distribution in a bidirectional manner while avoiding the need for a unified latent space. Based on the decoupling of modalities during training, we also introduce a joint sampling method that supports test-time scaling, where action and vision tokens evolve asynchronously at different rates. Through experiments on simulated benchmarks such as RoboCasa and GR-1, DUST achieves up to 6% gains over baseline methods, while our test-time scaling approach provides an additional 2-5% boost. On real-world tasks with the Franka Research 3, DUST improves success rates by 13%, confirming its effectiveness beyond simulation. Furthermore, pre-training on action-free videos from BridgeV2 yields significant transfer gains on RoboCasa, underscoring DUST's potential for large-scale VLA pretraining.
Abstract（参考訳）: 近年,VLA(Vision-Language-Action Model)を世界モデルで拡張することで,ロボットポリシー学習の改善が期待されている。しかし、2つのモードの間に固有の違いがあるため、次の状態の観測と行動シーケンスを共同で予測することは依然として困難である。そこで本研究では,モダリティコンフリクトに対処し,多様なタスクにおけるVLAの性能を向上させる世界モデル拡張VLAフレームワークである Dual-STream diffusion (DUST) を提案する。具体的には,マルチモーダル拡散トランスフォーマーアーキテクチャを提案する。さらに、各モードに対する独立ノイズ摂動と、分離されたフローマッチング損失を導入する。この設計により、統一された潜在空間の必要性を回避しつつ、両方向の関節分布を学習することができる。また、トレーニング中のモダリティの分離に基づいて、アクショントークンとビジョントークンが異なる速度で非同期に進化するテストタイムスケーリングをサポートする共同サンプリング手法を導入する。 RoboCasaやGR-1といったシミュレーションベンチマークの実験を通じて、DUSTはベースラインメソッドよりも最大6%向上し、テストタイムスケーリングアプローチではさらに2-5%向上した。 Franka Research 3による実世界のタスクでは、DUSTは成功率を13%改善し、シミュレーションを超えてその効果を確認している。さらに、BridgeV2からのアクションフリービデオの事前トレーニングは、大規模なVLA事前トレーニングに対するDUSTの可能性を裏付けるRoboCasaに大きな転送ゲインをもたらす。

論文の概要: Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model

関連論文リスト