Fugu-MT 論文翻訳(概要): VERDI: VLM-Embedded Reasoning for Autonomous Driving

論文の概要: VERDI: VLM-Embedded Reasoning for Autonomous Driving

arxiv url: http://arxiv.org/abs/2505.15925v1
Date: Wed, 21 May 2025 18:24:36 GMT
ステータス: 翻訳完了
システム内更新日: 2025-05-23 17:12:47.861669
Title: VERDI: VLM-Embedded Reasoning for Autonomous Driving
Title（参考訳）: VERDI: 自動運転のためのVLM組み込み推論
Authors: Bowen Feng, Zhiting Mei, Baiang Li, Julian Ost, Roger Girgis, Anirudha Majumdar, Felix Heide,
Abstract要約: 自律走行(VERDI)のためのVLM埋め込み推論を提案する。 VERDIは、VLMの推論プロセスと常識知識をADスタックに蒸留するトレーニングタイムフレームワークである。提案手法の有効性をNuScenesデータセットに示すとともに,VERDIが既存のe2eメソッドより優れていることを示す。
参考スコア（独自算出の注目度）: 33.66777025242027
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While autonomous driving (AD) stacks struggle with decision making under partial observability and real-world complexity, human drivers are capable of commonsense reasoning to make near-optimal decisions with limited information. Recent work has attempted to leverage finetuned Vision-Language Models (VLMs) for trajectory planning at inference time to emulate human behavior. Despite their success in benchmark evaluations, these methods are often impractical to deploy (a 70B parameter VLM inference at merely 8 tokens per second requires more than 160G of memory), and their monolithic network structure prohibits safety decomposition. To bridge this gap, we propose VLM-Embedded Reasoning for autonomous Driving (VERDI), a training-time framework that distills the reasoning process and commonsense knowledge of VLMs into the AD stack. VERDI augments modular differentiable end-to-end (e2e) AD models by aligning intermediate module outputs at the perception, prediction, and planning stages with text features explaining the driving reasoning process produced by VLMs. By encouraging alignment in latent space, \textsc{VERDI} enables the modular AD stack to internalize structured reasoning, without incurring the inference-time costs of large VLMs. We demonstrate the effectiveness of our method on the NuScenes dataset and find that VERDI outperforms existing e2e methods that do not embed reasoning by 10% in $\ell_{2}$ distance, while maintaining high inference speed.
Abstract（参考訳）: 自律運転(AD)スタックは、部分的な可観測性と実世界の複雑さの下で意思決定に苦労するが、人間ドライバーは、限られた情報でほぼ最適な決定を行うための常識的推論を行うことができる。近年の研究では、微調整された視覚言語モデル(VLM)を推論時の軌道計画に活用し、人間の行動をエミュレートしている。ベンチマーク評価の成功にもかかわらず、これらの手法はデプロイには実用的ではない(毎秒8トークンのVLM推論では160G以上のメモリを必要とする)ため、そのモノリシックネットワーク構造は安全性の低下を禁止している。このギャップを埋めるために、VLMの推論プロセスと共通知識をADスタックに蒸留する訓練時間フレームワークである、自律運転のためのVLM組込み推論(VERDI)を提案する。 VERDIは、モジュール間モジュール出力を知覚、予測、計画段階に整列させ、VLMが生成する駆動推論プロセスを説明するテキスト特徴とすることで、モジュラー微分可能なエンド・ツー・エンド(e2e)ADモデルを拡張している。潜在空間のアライメントを促進することで、大きなVLMの推論時間コストを発生させることなく、モジュラADスタックを構造的推論の内部化することができる。提案手法の有効性をNuScenesデータセット上で実証し,提案手法は推論速度を維持しつつ,$$\ell_{2}$ 距離で推論を10%の精度で埋め込まない既存の e2e 手法よりも優れていることを示す。

論文の概要: VERDI: VLM-Embedded Reasoning for Autonomous Driving

関連論文リスト