Fugu-MT 論文翻訳(概要): Bridging Scene Generation and Planning: Driving with World Model via Unifying Vision and Motion Representation

論文の概要: Bridging Scene Generation and Planning: Driving with World Model via Unifying Vision and Motion Representation

arxiv url: http://arxiv.org/abs/2603.14948v1
Date: Mon, 16 Mar 2026 07:59:39 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-17 16:19:36.145648
Title: Bridging Scene Generation and Planning: Driving with World Model via Unifying Vision and Motion Representation
Title（参考訳）: ブリッジングシーンの生成と計画:視覚と運動表現の統合による世界モデルによる運転
Authors: Xingtai Gui, Meijie Zhang, Tianyi Yan, Wencheng Han, Jiahao Gong, Feiyang Tan, Cheng-zhong Xu, Jianbing Shen,
Abstract要約: We present WorldDrive, a holistic framework that couples scene generation and real-time planning through unified vision and motion representation。動きの表現、視覚的表現、エゴ状態の間の単純な相互作用は、高品質でマルチモーダルな軌道を生成することができる。 NAVSIM、NAVSIM-v2、nuScenesベンチマークの実験は、WorldDriveが視覚のみの手法で主要な計画性能を達成することを示した。
参考スコア（独自算出の注目度）: 66.7879424097418
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: End-to-end autonomous driving aims to generate safe and plausible planning policies from raw sensor input. Driving world models have shown great potential in learning rich representations by predicting the future evolution of a driving scene. However, existing driving world models primarily focus on visual scene representation, and motion representation is not explicitly designed to be planner-shared and inheritable, leaving a schism between the optimization of visual scene generation and the requirements of precise motion planning. We present WorldDrive, a holistic framework that couples scene generation and real-time planning via unifying vision and motion representation. We first introduce a Trajectory-aware Driving World Model, which conditions on a trajectory vocabulary to enforce consistency between visual dynamics and motion intentions, enabling the generation of diverse and plausible future scenes conditioned on a specific trajectory. We transfer the vision and motion encoders to a downstream Multi-modal Planner, ensuring the driving policy operates on mature representations pre-optimized by scene generation. A simple interaction between motion representation, visual representation, and ego status can generate high-quality, multi-modal trajectories. Furthermore, to exploit the world model's foresight, we propose a Future-aware Rewarder, which distills future latent representation from the frozen world model to evaluate and select optimal trajectories in real-time. Extensive experiments on the NAVSIM, NAVSIM-v2, and nuScenes benchmarks demonstrate that WorldDrive achieves leading planning performance among vision-only methods while maintaining high-fidelity action-controlled video generation capabilities, providing strong evidence for the effectiveness of unifying vision and motion representation for robust autonomous driving.
Abstract（参考訳）: エンドツーエンドの自動運転は、生のセンサー入力から安全で妥当な計画ポリシーを生成することを目的としている。運転の世界モデルは、運転シーンの将来的な進化を予測することによって、豊かな表現を学ぶ大きな可能性を示している。しかし、既存の運転世界モデルは、主に視覚シーンの表現に焦点を当てており、動きの表現はプランナーが共有し、継承可能であるように明示的に設計されておらず、視覚シーン生成の最適化と正確な動きの計画の要求との間には混乱が残されている。 We present WorldDrive, a holistic framework that couples scene generation and real-time planning through unified vision and motion representation。まず,視覚力学と運動意図の整合性を確保するために,軌跡ボキャブラリを条件としたトラジェクトリ対応運転世界モデルを提案する。我々は、視覚とモーションエンコーダを下流のマルチモーダルプランナーに転送し、シーン生成によって予め最適化された成熟した表現を駆動ポリシーが動作することを保証する。動きの表現、視覚的表現、エゴ状態の間の単純な相互作用は、高品質でマルチモーダルな軌道を生成することができる。さらに, 凍結した世界モデルから将来の潜在表現を抽出し, 最適軌道をリアルタイムで評価し, 選択するFuture-Aware Rewarderを提案する。 NAVSIM, NAVSIM-v2, nuScenesベンチマークの広範な実験により、WorldDriveは高忠実度アクション制御ビデオ生成能力を保ちながら、視覚のみの手法における主要な計画性能を達成し、堅牢な自律運転における視覚と運動表現の統一効果の強い証拠を提供する。

論文の概要: Bridging Scene Generation and Planning: Driving with World Model via Unifying Vision and Motion Representation

関連論文リスト