Fugu-MT 論文翻訳(概要): Fast-dDrive: Efficient Block-Diffusion VLM for Autonomous Driving

論文の概要: Fast-dDrive: Efficient Block-Diffusion VLM for Autonomous Driving

arxiv url: http://arxiv.org/abs/2605.23163v2
Date: Mon, 25 May 2026 07:32:46 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-26 16:32:38.050065
Title: Fast-dDrive: Efficient Block-Diffusion VLM for Autonomous Driving
Title（参考訳）: Fast-dDrive: 自律運転のための効率的なブロック拡散VLM
Authors: Kewei Zhang, Jin Wang, Sensen Gao, Chengyue Wu, Yulong Cao, Songyang Han, Boris Ivanovic, Langechuan Liu, Marco Pavone, Song Han, Daquan Zhou, Enze Xie,
Abstract要約: 本稿では,ブロック拡散型VLAであるFast-dDriveについて述べる。我々は、Fast-dDriveが運転エージェントの速度精度フロンティアを再定義することを示す。
参考スコア（独自算出の注目度）: 54.31800246594724
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: End-to-end autonomous driving via Vision-Language-Action (VLA) models demands a precarious balance between high-fidelity trajectory planning and efficient inference. Existing paradigms typically fall short: autoregressive (AR) VLAs are memory-bandwidth-bound on edge hardware and prone to exposure-bias drift, while full-sequence diffusion models preclude KV-cache reuse and suffer from "logical leakage" that violates the fundamental perceive-then-plan causality. We present Fast-dDrive, a block-diffusion VLA that performs bidirectional refinement within semantic units while enforcing strict causal ordering across them. Leveraging the observation that driving VLAs often emit structured JSON-like outputs, Fast-dDrive freezes structural tokens into a section scaffold and employs a section-aware training recipe that prioritizes safety-critical planning. We further introduce Scaffold Speculative Decoding to achieve AR-equivalent quality at significantly higher throughput. Finally, we propose a low-overhead test-time scaling scheme: by forking $N$ stochastic trajectory rollouts from a single shared-prefix KV cache and averaging them, we effectively suppress prediction variance at a fractional computational cost. Empirical results demonstrate that Fast-dDrive redefines the speed-accuracy frontier for driving agents. On the WOD-E2E test set, Fast-dDrive achieves SOTA ADE@3s and ADE@5s, alongside the highest RFS among diffusion-based VLAs; on nuScenes, it reduces average L2 error to $0.32$m (a $22\%$ improvement). When integrated with SGLang, our framework delivers $12\times$ throughput speedup over the AR baseline, narrowing the gap between high-capacity VLAs and the efficiency demands of real-time on-vehicle deployment.
Abstract（参考訳）: VLA(Vision-Language-Action)モデルによるエンドツーエンドの自動運転は、高忠実度軌道計画と効率的な推論の間に、予期せぬバランスを必要とする。自己回帰(AR) VLAはエッジハードウェア上でメモリ帯域幅バウンドであり、露光バイアスがドリフトする傾向にあるのに対して、フルシーケンス拡散モデルはKV-cacheの再利用を妨げ、基本的な知覚的計画因果性に反する「論理的漏洩」に悩まされる。本稿では,ブロック拡散VLAであるFast-dDriveについて述べる。 VLAを駆動する場合には、構造化されたJSONライクな出力を出力することが多いため、Fast-dDriveは、構造トークンをセクションの足場に凍結し、安全クリティカルなプランニングを優先するセクション対応のトレーニングレシピを使用する。さらに,Scaffold Speculative Decodingを導入し,高いスループットでAR等価な品質を実現する。最後に、1つの共有プリフィックスKVキャッシュから$N$確率軌道ロールアウトをフォークし、それらを平均化することにより、分数計算コストで予測分散を効果的に抑制する。実験の結果、Fast-dDriveは運転エージェントの速度精度フロンティアを再定義している。 WOD-E2Eテストセットでは、Fast-dDriveがSOTA ADE@3sとADE@5sを達成した。 SGLangと統合することで、当社のフレームワークは、ARベースライン上で12ドル以上のスループットのスピードアップを提供し、高容量のVLAとリアルタイムの車載デプロイメントの効率要件のギャップを狭めることができます。

論文の概要: Fast-dDrive: Efficient Block-Diffusion VLM for Autonomous Driving

関連論文リスト