Fugu-MT 論文翻訳(概要): Ground Slow, Move Fast: A Dual-System Foundation Model for Generalizable Vision-and-Language Navigation

論文の概要: Ground Slow, Move Fast: A Dual-System Foundation Model for Generalizable Vision-and-Language Navigation

arxiv url: http://arxiv.org/abs/2512.08186v1
Date: Tue, 09 Dec 2025 02:29:36 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-23 08:17:40.218642
Title: Ground Slow, Move Fast: A Dual-System Foundation Model for Generalizable Vision-and-Language Navigation
Title（参考訳）: Ground Slow, Move Fast: Dual-System Foundation Model for Generalizable Vision-and-Language Navigation
Authors: Meng Wei, Chenyang Wan, Jiaqi Peng, Xiqian Yu, Yuqiang Yang, Delin Feng, Wenzhe Cai, Chenming Zhu, Tai Wang, Jiangmiao Pang, Xihui Liu,
Abstract要約: 本稿では,高レベル推論と低レベル動作実行を統合した視覚言語ナビゲーションシステムであるDualVLNを提案する。 System 1は、System 2の明示的なピクセル目標と潜在機能の両方を活用して、スムーズで正確な軌跡を生成することで、"高速に動く"。システムは全てのVLNベンチマークや実世界の実験で先行手法よりも優れており、堅牢な長期計画とリアルタイム適応性を示している。
参考スコア（独自算出の注目度）: 45.54638103934175
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While recent large vision-language models (VLMs) have improved generalization in vision-language navigation (VLN), existing methods typically rely on end-to-end pipelines that map vision-language inputs directly to short-horizon discrete actions. Such designs often produce fragmented motions, incur high latency, and struggle with real-world challenges like dynamic obstacle avoidance. We propose DualVLN, the first dual-system VLN foundation model that synergistically integrates high-level reasoning with low-level action execution. System 2, a VLM-based global planner, "grounds slowly" by predicting mid-term waypoint goals via image-grounded reasoning. System 1, a lightweight, multi-modal conditioning Diffusion Transformer policy, "moves fast" by leveraging both explicit pixel goals and latent features from System 2 to generate smooth and accurate trajectories. The dual-system design enables robust real-time control and adaptive local decision-making in complex, dynamic environments. By decoupling training, the VLM retains its generalization, while System 1 achieves interpretable and effective local navigation. DualVLN outperforms prior methods across all VLN benchmarks and real-world experiments demonstrate robust long-horizon planning and real-time adaptability in dynamic environments.
Abstract（参考訳）: 最近の大規模視覚言語モデル(VLM)は、視覚言語ナビゲーション(VLN)の一般化を改善しているが、既存の手法は通常、視覚言語入力を直接短時間の離散動作にマッピングするエンドツーエンドパイプラインに依存している。このような設計は、しばしば断片化された動きを生成し、高い遅延を発生させ、動的障害物回避のような現実世界の課題に対処する。我々は,高レベル推論と低レベル動作実行を相乗的に統合した最初のデュアルシステムVLN基盤モデルであるDualVLNを提案する。システム2 - VLMをベースとしたグローバルプランナーで、画像基底推論により中間地点の目標を予測する。 System 1は、軽量でマルチモーダルなDiffusion Transformerポリシーであり、System 2の明示的な画素目標と潜時特徴の両方を活用して、スムーズで正確な軌跡を生成する。デュアルシステム設計は、複雑な動的環境において、堅牢なリアルタイム制御と適応的な局所的な意思決定を可能にする。トレーニングを分離することで、VLMはその一般化を維持し、System 1は解釈可能で効果的なローカルナビゲーションを実現している。 DualVLNは、全てのVLNベンチマークと実世界の実験で、動的環境における堅牢な長期計画とリアルタイム適応性を実証している。

論文の概要: Ground Slow, Move Fast: A Dual-System Foundation Model for Generalizable Vision-and-Language Navigation

関連論文リスト