Fugu-MT 論文翻訳(概要): ST-Prune: Training-Free Spatio-Temporal Token Pruning for Vision-Language Models in Autonomous Driving

論文の概要: ST-Prune: Training-Free Spatio-Temporal Token Pruning for Vision-Language Models in Autonomous Driving

arxiv url: http://arxiv.org/abs/2604.19145v1
Date: Tue, 21 Apr 2026 06:51:08 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-22 22:41:49.658144
Title: ST-Prune: Training-Free Spatio-Temporal Token Pruning for Vision-Language Models in Autonomous Driving
Title（参考訳）: ST-Prune:自律運転におけるビジョンランゲージモデルのための訓練不要時空間トーケンプルーニング
Authors: Lin Sha, Haiyun Guo, Tao Wang, Cong Zhang, Min Huang, Jinqiao Wang, Qinghai Miao,
Abstract要約: 我々は、Motionaware RSP (MTP) と Ring-view Spatial Pruning (RSPRSP) の2つの相補的なモジュールからなるトレーニングフリーのプラグアンドプレイフレームワークST-Pruneを提案する。これら2つのモジュールは完全な時間的プルーニングプロセスを構成し、トレーニング不要なトークンプルーニングのための重要な幾何学的技法を保存する。 ST-Pruneは、既存のプルーニングアプローチに匹敵する速度を維持しながら、フルモデルベースラインを超える特定のメトリクスで、ほぼロスレスのパフォーマンスを達成する。
参考スコア（独自算出の注目度）: 31.688411695647357
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language Models (VLMs) have become central to autonomous driving systems, yet their deployment is severely bottlenecked by the massive computational overhead of multi-view camera and multi-frame video input. Existing token pruning methods, primarily designed for single-image inputs, treat each frame or view in isolation and thus fail to exploit the inherent spatio-temporal redundancies in driving scenarios. To bridge this gap, we propose ST-Prune, a training-free, plug-and-play framework comprising two complementary modules: Motion-aware Temporal Pruning (MTP) and Ring-view Spatial Pruning (RSP). MTP addresses temporal redundancy by encoding motion volatility and temporal recency as soft constraints within the diversity selection objective, prioritizing dynamic trajectories and current-frame content over static historical background. RSP further resolves spatial redundancy by exploiting the ring-view camera geometry to penalize bilateral cross-view similarity, eliminating duplicate projections and residual background that temporal pruning alone cannot suppress. These two modules together constitute a complete spatio-temporal pruning process, preserving key scene information under strict compression. Validated across four benchmarks spanning perception, prediction, and planning, ST-Prune establishes new state-of-the-art for training-free token pruning. Notably, even at 90\% token reduction, ST-Prune achieves near-lossless performance with certain metrics surpassing the full-model baseline, while maintaining inference speeds comparable to existing pruning approaches.
Abstract（参考訳）: VLM(Vision-Language Models)は、自律走行システムの中心となっているが、マルチビューカメラとマルチフレームビデオ入力の計算オーバーヘッドによって、その展開が著しくボトルネックになっている。既存のトークンプルーニング手法は、主に単一イメージの入力用に設計されており、それぞれのフレームやビューを独立して扱うため、駆動シナリオに固有の時空間冗長性を利用することができない。このギャップを埋めるために,ST-Pruneを提案する。ST-Pruneは2つの相補的なモジュール,Motion-Aware Temporal Pruning (MTP) と Ring-view Spatial Pruning (RSP) で構成されている。 MTPは、動的軌跡と静的な背景上の現在のフレーム内容の優先順位を優先し、多様性選択対象内のソフト制約として、動きのボラティリティと時間的電流を符号化することで、時間的冗長性に対処する。 RSPは、リングビューカメラ幾何を利用して、両側のクロスビュー類似性をペナルティ化し、時間的プルーニングだけでは抑制できない、重複した投影と残留背景を取り除くことにより、空間的冗長性をさらに解決する。これら2つのモジュールは、厳密な圧縮の下でキーシーン情報を保存し、完全な時空間プルーニングプロセスを構成する。 ST-Pruneは、知覚、予測、計画にまたがる4つのベンチマークで検証され、トレーニング不要なトークンプルーニングのための新しい最先端技術を確立している。特に、90\%のトークン削減でも、ST-Pruneは、既存のプルーニングアプローチに匹敵する推論速度を維持しながら、フルモデルベースラインを超える特定のメトリクスで、ほぼ無作為なパフォーマンスを達成する。

論文の概要: ST-Prune: Training-Free Spatio-Temporal Token Pruning for Vision-Language Models in Autonomous Driving

関連論文リスト