Fugu-MT 論文翻訳(概要): Prune2Drive: A Plug-and-Play Framework for Accelerating Vision-Language Models in Autonomous Driving

論文の概要: Prune2Drive: A Plug-and-Play Framework for Accelerating Vision-Language Models in Autonomous Driving

arxiv url: http://arxiv.org/abs/2508.13305v1
Date: Mon, 18 Aug 2025 18:47:26 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-20 15:36:31.704298
Title: Prune2Drive: A Plug-and-Play Framework for Accelerating Vision-Language Models in Autonomous Driving
Title（参考訳）: Prune2Drive: 自動運転におけるビジョンランゲージモデルの高速化のためのプラグイン・アンド・プレイフレームワーク
Authors: Minhao Xiong, Zichen Wen, Zhuangcheng Gu, Xuyang Liu, Rui Zhang, Hengrui Kang, Jiabing Yang, Junyuan Zhang, Weijia Li, Conghui He, Yafei Wang, Linfeng Zhang,
Abstract要約: VLM(Vision-Language Models)は、自動運転において有望なパラダイムとして登場した。 VLMは、視覚入力と自然言語命令を共同でモデル化することで、認識、推論、意思決定のための統一されたフレームワークを提供する。自律運転における多視点VLMのためのプラグ&プレイ型ビジュアルトークンプルーニングフレームワークPrune2Driveを提案する。
参考スコア（独自算出の注目度）: 24.2108745917843
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language Models (VLMs) have emerged as a promising paradigm in autonomous driving (AD), offering a unified framework for perception, reasoning, and decision-making by jointly modeling visual inputs and natural language instructions. However, their deployment is hindered by the significant computational overhead incurred when processing high-resolution, multi-view images, a standard setup in AD systems with six or more synchronized cameras. This overhead stems from the large number of visual tokens generated during encoding, increasing inference latency and memory consumption due to the quadratic complexity of self-attention. To address these challenges, we propose Prune2Drive, a plug-and-play visual token pruning framework for multi-view VLMs in autonomous driving. Prune2Drive introduces two core innovations: (i) a diversity-aware token selection mechanism inspired by farthest point sampling, which prioritizes semantic and spatial coverage across views rather than relying solely on attention scores, and (ii) a view-adaptive pruning controller that learns optimal pruning ratios for each camera view based on their importance to downstream driving tasks. Unlike prior methods, Prune2Drive does not require model retraining or access to attention maps, making it compatible with modern efficient attention implementations. Extensive experiments on two large-scale multi-view driving benchmarks, DriveLM and DriveLMM-o1, show that Prune2Drive achieves significant speedups and memory savings while maintaining or improving task performance. When retaining only 10% of the visual tokens, our method achieves a 6.40$\times$ speedup in the prefilling phase and consumes 13.4% of the original FLOPs, with only a 3% performance drop on the DriveLM benchmark.
Abstract（参考訳）: VLM(Vision-Language Models)は、視覚入力と自然言語命令を共同でモデル化することによって、知覚、推論、意思決定のための統一されたフレームワークを提供する、自律運転(AD)における有望なパラダイムとして登場した。しかし、それらの展開は、6つ以上の同期カメラを備えたADシステムにおける標準設定である高解像度のマルチビュー画像を処理する際に発生する計算オーバーヘッドによって妨げられている。このオーバーヘッドはエンコーディング中に発生する視覚トークンの多さに起因しており、自己アテンションの二次的な複雑さのために、推論遅延とメモリ消費が増加する。これらの課題に対処するために,自律運転における多視点VLMのためのプラグアンドプレイ視覚トークンプルーニングフレームワークPrune2Driveを提案する。 Prune2Driveが2つのコアイノベーションを導入一注目点のみに頼らず、視点間の意味的・空間的カバレッジを優先する最遠点サンプリングに触発された多様性に配慮したトークン選択機構 (i)下流運転タスクの重要性に基づき、各カメラビューに対して最適なプルーニング比を学習するビュー適応プルーニングコントローラ。従来の方法とは異なり、Prune2Driveはモデルの再トレーニングやアテンションマップへのアクセスを必要としないため、現代の効率的なアテンション実装と互換性がある。 DriveLMとDriveLMM-o1という2つの大規模マルチビュー駆動ベンチマークの大規模な実験は、Prune2Driveがタスク性能を維持したり改善したりしながら、大幅なスピードアップとメモリ節約を実現していることを示している。視覚トークンの10%しか保持していない場合、プリフィルフェーズで6.40$\times$の高速化を実現し、元のFLOPの13.4%を消費し、DriveLMベンチマークでは3%のパフォーマンス低下しか達成していない。

論文の概要: Prune2Drive: A Plug-and-Play Framework for Accelerating Vision-Language Models in Autonomous Driving

関連論文リスト