Fugu-MT 論文翻訳(概要): Uni-World VLA: Interleaved World Modeling and Planning for Autonomous Driving

論文の概要: Uni-World VLA: Interleaved World Modeling and Planning for Autonomous Driving

arxiv url: http://arxiv.org/abs/2603.27287v1
Date: Sat, 28 Mar 2026 14:39:51 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-31 23:18:44.880695
Title: Uni-World VLA: Interleaved World Modeling and Planning for Autonomous Driving
Title（参考訳）: Uni-World VLA: 自律運転のためのインターリーブワールドモデリングと計画
Authors: Qiqi Liu, Huan Xu, Jingyu Li, Bin Sun, Zhihui Hao, Dangen She, Xiatian Zhu, Li Zhang,
Abstract要約: 我々は、将来のフレーム予測と軌道計画の密接なインターリーブを行う統合視覚言語行動モデルUni-World VLAを提案する。提案手法は,高忠実度将来のフレーム予測を行いながら,競合する閉ループ計画性能を実現する。
参考スコア（独自算出の注目度）: 52.04950569530877
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Autonomous driving requires reasoning about how the environment evolves and planning actions accordingly. Existing world-model-based approaches typically predict future scenes first and plan afterwards, resulting in open-loop imagination that may drift from the actual decision process. In this paper, we present Uni-World VLA, a unified vision-language-action (VLA) model that tightly interleaves future frame prediction and trajectory planning. Instead of generating a full world rollout before planning, our model alternates between predicting future frames and ego actions step by step, allowing planning decisions to be continuously conditioned on the imagined future observations. This interleaved generation forms a closed-loop interaction between world modeling and control, enabling more adaptive decision-making in dynamic traffic scenarios. In addition, we incorporate monocular depth information into frames to provide stronger geometric cues for world modeling, improving long-horizon scene prediction. Experiments on the NAVSIM benchmark show that our approach achieves competitive closed-loop planning performance while producing high-fidelity future frame predictions. These results demonstrate that tightly coupling world prediction and planning is a promising direction for scalable VLA driving systems.
Abstract（参考訳）: 自律運転は、環境がどのように進化し、それに従って計画行動を行うかについての推論を必要とする。既存の世界モデルベースのアプローチでは、まず将来のシーンを予測し、その後計画する。本稿では、将来のフレーム予測と軌道計画の密接なインターリーブを行う統合視覚言語行動(VLA)モデルであるUni-World VLAを提案する。計画を立てる前に完全な世界展開を生成する代わりに、我々のモデルは将来の枠組みを予測することとエゴアクションをステップバイステップで交互に行い、計画決定を想像された将来の観測で継続的に条件付けできるようにします。このインターリーブ生成は、世界モデリングと制御の間の閉ループ相互作用を形成し、動的な交通シナリオにおいてより適応的な意思決定を可能にする。さらに,一様深度情報をフレームに組み込むことにより,世界モデリングのためのより強力な幾何学的手法を提供し,長軸シーンの予測を改善する。 NAVSIMベンチマーク実験により,提案手法は高速な将来のフレーム予測を行いながら,競合する閉ループ計画性能を実現することを示す。これらの結果は,世界予測と計画の密結合が,スケーラブルなVLA駆動システムにとって有望な方向であることを証明している。

論文の概要: Uni-World VLA: Interleaved World Modeling and Planning for Autonomous Driving

関連論文リスト