Fugu-MT 論文翻訳(概要): DriveDreamer-Policy: A Geometry-Grounded World-Action Model for Unified Generation and Planning

論文の概要: DriveDreamer-Policy: A Geometry-Grounded World-Action Model for Unified Generation and Planning

arxiv url: http://arxiv.org/abs/2604.01765v1
Date: Thu, 02 Apr 2026 08:33:18 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-03 14:21:10.616916
Title: DriveDreamer-Policy: A Geometry-Grounded World-Action Model for Unified Generation and Planning
Title（参考訳）: DriveDreamer-Policy:一元化と計画のための幾何学的世界アクションモデル
Authors: Yang Zhou, Xiaofeng Wang, Hao Shao, Letian Wang, Guosheng Zhao, Jiangnan Shao, Jiagang Zhu, Tingdong Yu, Zheng Zhu, Guan Huang, Steven L. Waslander,
Abstract要約: DriveDreamer-Policyは、深度生成、将来のビデオ生成、モーションプランニングを統合した統合運転ワールドアクションモデルである。提案したモデルは、モジュラリティと遅延制御性を維持しながら、より一貫性のある未来とより情報のある駆動動作を生成する。
参考スコア（独自算出の注目度）: 44.543763428623976
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Recently, world-action models (WAM) have emerged to bridge vision-language-action (VLA) models and world models, unifying their reasoning and instruction-following capabilities and spatio-temporal world modeling. However, existing WAM approaches often focus on modeling 2D appearance or latent representations, with limited geometric grounding-an essential element for embodied systems operating in the physical world. We present DriveDreamer-Policy, a unified driving world-action model that integrates depth generation, future video generation, and motion planning within a single modular architecture. The model employs a large language model to process language instructions, multi-view images, and actions, followed by three lightweight generators that produce depth, future video, and actions. By learning a geometry-aware world representation and using it to guide both future prediction and planning within a unified framework, the proposed model produces more coherent imagined futures and more informed driving actions, while maintaining modularity and controllable latency. Experiments on the Navsim v1 and v2 benchmarks demonstrate that DriveDreamer-Policy achieves strong performance on both closed-loop planning and world generation tasks. In particular, our model reaches 89.2 PDMS on Navsim v1 and 88.7 EPDMS on Navsim v2, outperforming existing world-model-based approaches while producing higher-quality future video and depth predictions. Ablation studies further show that explicit depth learning provides complementary benefits to video imagination and improves planning robustness.
Abstract（参考訳）: 近年、世界行動モデル(WAM)は視覚言語行動モデル(VLA)と世界モデルを橋渡しし、その推論と指示追従能力と時空間的世界モデリングを統一している。しかし、既存のWAMアプローチでは、物理世界で動作しているエンボディドシステムにとって、幾何学的な接地が欠如しているため、2Dの外観や潜伏表現のモデリングに重点を置いていることが多い。 DriveDreamer-Policyは、深度生成、将来のビデオ生成、モーションプランニングを単一のモジュラーアーキテクチャに統合した統合駆動ワールドアクションモデルである。このモデルは、言語命令、マルチビューイメージ、アクションを処理するために大きな言語モデルを使用し、続いて深度、将来のビデオ、アクションを生成する3つの軽量ジェネレータが続く。幾何学を意識した世界表現を学習し、それを統合されたフレームワーク内での将来の予測と計画の両方を導くことによって、提案モデルは、モジュール性と制御可能なレイテンシを維持しながら、より一貫性のある未来とより情報のある駆動行動を生成する。 Navsim v1とv2ベンチマークの実験では、DriveDreamer-Policyはクローズドループ計画とワールドジェネレーションタスクの両方で強力なパフォーマンスを実現している。特に,Navsim v1では89.2 PDMS,Navsim v2では88.7 PDMSに到達し,既存の世界モデルベースのアプローチより優れ,高品質な映像・深度予測を実現している。アブレーション研究は、明示的な深度学習がビデオの想像力に相補的な利点をもたらし、プランニングの堅牢性を改善することを示している。

論文の概要: DriveDreamer-Policy: A Geometry-Grounded World-Action Model for Unified Generation and Planning

関連論文リスト