Fugu-MT 論文翻訳(概要): Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model

論文の概要: Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model

arxiv url: http://arxiv.org/abs/2510.27607v2
Date: Tue, 04 Nov 2025 14:46:28 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-05 14:27:17.391104
Title: Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model
Title（参考訳）: World-model Augmented Vision-Language-Action Modelのためのデュアルストリーム拡散
Authors: John Won, Kyungmin Lee, Huiwon Jang, Dongyoung Kim, Jinwoo Shin,
Abstract要約: 本稿では,モダリティ競合を処理する世界モデル拡張VLAフレームワークである Dual-STream diffusion (DUST) を提案する。 DUSTは標準のVLAベースラインと暗黙のワールドモデリングメソッドよりも最大6%向上する。 Franka Research 3による実世界のタスクでは、DUSTは成功率のベースラインを13%上回っている。
参考スコア（独自算出の注目度）: 62.889356203346985
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recently, augmenting vision-language-action models (VLAs) with world-models has shown promise in robotic policy learning. However, it remains challenging to jointly predict next-state observations and action sequences because of the inherent difference between the two modalities. To address this, we propose DUal-STream diffusion (DUST), a world-model augmented VLA framework that handles the modality conflict and enhances the performance of VLAs across diverse tasks. Specifically, we propose a multimodal diffusion transformer architecture that explicitly maintains separate modality streams while enabling cross-modal knowledge sharing. In addition, we propose training techniques such as independent noise perturbations for each modality and a decoupled flow matching loss, which enables the model to learn the joint distribution in a bidirectional manner while avoiding the need for a unified latent space. Furthermore, based on the decoupled training framework, we introduce a sampling method where we sample action and vision tokens asynchronously at different rates, which shows improvement through inference-time scaling. Through experiments on simulated benchmarks such as RoboCasa and GR-1, DUST achieves up to 6% gains over a standard VLA baseline and implicit world-modeling methods, with our inference-time scaling approach providing an additional 2-5% gain on success rate. On real-world tasks with the Franka Research 3, DUST outperforms baselines in success rate by 13%, confirming its effectiveness beyond simulation. Lastly, we demonstrate the effectiveness of DUST in large-scale pretraining with action-free videos from BridgeV2, where DUST leads to significant gain when transferred to the RoboCasa benchmark.
Abstract（参考訳）: 近年,世界モデルによる視覚言語行動モデル(VLA)の強化は,ロボット政策学習において有望であることが示されている。しかし、2つのモードの間に固有の違いがあるため、次の状態の観測と行動シーケンスを共同で予測することは依然として困難である。そこで本研究では,モダリティコンフリクトに対処し,多様なタスクにおけるVLAの性能を向上させる世界モデル拡張VLAフレームワークである Dual-STream diffusion (DUST) を提案する。具体的には,マルチモーダル拡散トランスフォーマーアーキテクチャを提案する。さらに,各モードに対する独立ノイズ摂動や非結合流整合損失などのトレーニング手法を提案し,統合された潜在空間の必要性を回避しつつ,両方向の連成分布を学習できるようにする。さらに、分離したトレーニングフレームワークに基づいて、異なるレートでアクショントークンとビジョントークンを非同期にサンプリングするサンプリング手法を導入し、推論時間スケーリングによる改善を示す。 RoboCasaやGR-1のようなシミュレーションベンチマークの実験を通じて、DUSTは標準的なVLAベースラインと暗黙のワールドモデリングメソッドに対して最大6%のゲインを達成する。 Franka Research 3の実際のタスクでは、DUSTは成功率を13%上回り、シミュレーションを超えてその効果を確認している。最後に,BridgeV2のアクションフリービデオによる大規模事前トレーニングにおけるDUSTの有効性を示す。

論文の概要: Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model

関連論文リスト