Fugu-MT 論文翻訳(概要): DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale

論文の概要: DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale

arxiv url: http://arxiv.org/abs/2604.00813v1
Date: Wed, 01 Apr 2026 12:21:26 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-02 16:44:31.979753
Title: DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale
Title（参考訳）: DVGT-2:大規模自動運転のための視覚幾何学的行動モデル(DVGT-2)
Authors: Sicheng Zuo, Zixun Xie, Wenzhao Zheng, Shaoqing Xu, Fang Li, Hanbing Li, Long Chen, Zhi-Xin Yang, Jiwen Lu,
Abstract要約: 本稿では,高密度な3次元形状を自律運転のクリティカルキューとして提唱するビジョン・ジオメトリ・アクションのパラダイムを提案する。本稿では,DVGT-2(Stream Driving Visual Geometry Transformer)を導入し,入力をオンラインに処理し,現行のフレームに対して高密度なジオメトリとトラジェクトリプランニングを共同で出力する。高速にもかかわらず、DVGT-2は様々なデータセット上で優れた幾何再構成性能を達成する。
参考スコア（独自算出の注目度）: 63.05446464787182
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: End-to-end autonomous driving has evolved from the conventional paradigm based on sparse perception into vision-language-action (VLA) models, which focus on learning language descriptions as an auxiliary task to facilitate planning. In this paper, we propose an alternative Vision-Geometry-Action (VGA) paradigm that advocates dense 3D geometry as the critical cue for autonomous driving. As vehicles operate in a 3D world, we think dense 3D geometry provides the most comprehensive information for decision-making. However, most existing geometry reconstruction methods (e.g., DVGT) rely on computationally expensive batch processing of multi-frame inputs and cannot be applied to online planning. To address this, we introduce a streaming Driving Visual Geometry Transformer (DVGT-2), which processes inputs in an online manner and jointly outputs dense geometry and trajectory planning for the current frame. We employ temporal causal attention and cache historical features to support on-the-fly inference. To further enhance efficiency, we propose a sliding-window streaming strategy and use historical caches within a certain interval to avoid repetitive computations. Despite the faster speed, DVGT-2 achieves superior geometry reconstruction performance on various datasets. The same trained DVGT-2 can be directly applied to planning across diverse camera configurations without fine-tuning, including closed-loop NAVSIM and open-loop nuScenes benchmarks.
Abstract（参考訳）: エンド・ツー・エンドの自律運転は、スパース認識に基づく従来のパラダイムから、計画を容易にする補助的なタスクとして言語記述の学習に焦点を当てた視覚言語行動モデルへと進化してきた。本稿では,高密度な3次元形状を自律走行のクリティカルキューとして提唱するビジョン・ジオメトリ・アクション(VGA)パラダイムを提案する。車両が3Dの世界を走るとき、私たちは密集した3D幾何学が意思決定に最も包括的な情報を提供すると考えている。しかし、既存の幾何再構成手法(例えばDVGT)は、計算に高価なマルチフレーム入力のバッチ処理に依存しており、オンラインプランニングには適用できない。そこで我々はDVGT-2 (Stream Driving Visual Geometry Transformer) を導入し, 入力をオンライン的に処理し, 現在のフレームに対して高密度な幾何学と軌道計画を共同で出力する。我々は時間的因果的注意と歴史的特徴のキャッシュをオンザフライ推論に利用した。効率をさらに高めるため,繰り返し計算を避けるために,スライドウインドウ・ストリーミング戦略を提案し,一定の間隔で履歴キャッシュを使用する。高速にもかかわらず、DVGT-2は様々なデータセット上で優れた幾何再構成性能を達成する。同じ訓練されたDVGT-2は、クローズループNAVSIMやオープンループnuScenesベンチマークなど、微調整なしで様々なカメラ構成の計画に直接適用することができる。

論文の概要: DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale

関連論文リスト