Fugu-MT 論文翻訳(概要): Revisiting Token Compression for Accelerating ViT-based Sparse Multi-View 3D Object Detectors

論文の概要: Revisiting Token Compression for Accelerating ViT-based Sparse Multi-View 3D Object Detectors

arxiv url: http://arxiv.org/abs/2604.14563v1
Date: Thu, 16 Apr 2026 02:46:53 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-17 21:29:31.69288
Title: Revisiting Token Compression for Accelerating ViT-based Sparse Multi-View 3D Object Detectors
Title（参考訳）: ViTベーススパース多視点オブジェクト検出器の高速化のためのトークン圧縮の再検討
Authors: Mingqian Ji, Shanshan Zhang, Jian Yang,
Abstract要約: SEPatch3Dは、粗いパッチ内の意味情報を保存しながら、パッチサイズを動的に調整する新しいフレームワークである。 nuScenesとArgoverse 2バリデーションセットの実験では、SEPatch3DはStreamPETRベースラインよりも最大57%高速な推論を実現し、最先端のToC3Dよりも20パーセント高速である。
参考スコア（独自算出の注目度）: 18.684602624559734
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision Transformer (ViT)-based sparse multi-view 3D object detectors have achieved remarkable accuracy but still suffer from high inference latency due to heavy token processing. To accelerate these models, token compression has been widely explored. However, our revisit of existing strategies, such as token pruning, merging, and patch size enlargement, reveals that they often discard informative background cues, disrupt contextual consistency, and lose fine-grained semantics, negatively affecting 3D detection. To overcome these limitations, we propose SEPatch3D, a novel framework that dynamically adjusts patch sizes while preserving critical semantic information within coarse patches. Specifically, we design Spatiotemporal-aware Patch Size Selection (SPSS) that assigns small patches to scenes containing nearby objects to preserve fine details and large patches to background-dominated scenes to reduce computation cost. To further mitigate potential detail loss, Informative Patch Selection (IPS) selects the informative patches for feature refinement, and Cross-Granularity Feature Enhancement (CGFE) injects fine-grained details into selected coarse patches, enriching semantic features. Experiments on the nuScenes and Argoverse 2 validation sets show that SEPatch3D achieves up to \textbf{57\%} faster inference than the StreamPETR baseline and \textbf{20\%} higher efficiency than the state-of-the-art ToC3D-faster, while preserving comparable detection accuracy. Code is available at https://github.com/Mingqj/SEPatch3D.
Abstract（参考訳）: Vision Transformer (ViT) ベースのスパースマルチビュー3Dオブジェクト検出器は、目覚ましい精度を達成したが、重いトークン処理のために高い推論遅延に悩まされている。これらのモデルを加速するため、トークン圧縮は広く研究されている。しかし,トークンプルーニングやマージ,パッチサイズ拡大といった既存の戦略を再検討した結果,情報的背景手がかりを排除したり,コンテキスト整合性を損なったり,微粒なセマンティクスを失い,3D検出に悪影響を及ぼすことが判明した。これらの制限を克服するため、我々は、粗いパッチ内で重要な意味情報を保存しながら、パッチサイズを動的に調整する新しいフレームワークであるSEPatch3Dを提案する。具体的には,SPSS (Spatiotemporal-aware Patch Size Selection) を設計し,周辺オブジェクトを含むシーンに小さなパッチを割り当てて,背景が支配するシーンに細部と大きなパッチを保存し,計算コストを削減する。潜在的な詳細損失を軽減するため、Informative Patch Selection (IPS)は機能改善のための情報パッチを選択し、Cross-Granularity Feature Enhancement (CGFE)は、選択された粗いパッチにきめ細かい詳細を注入し、セマンティックな特徴を豊かにする。 nuScenesとArgoverse 2バリデーションセットの実験では、SEPatch3DはStreamPETRベースラインよりも最大で \textbf{57\%} の高速な推論を実現し、最先端のToC3Dよりも高速なToC3Dの精度を保ちながら、同等な検出精度を保っている。コードはhttps://github.com/Mingqj/SEPatch3Dで入手できる。

論文の概要: Revisiting Token Compression for Accelerating ViT-based Sparse Multi-View 3D Object Detectors

関連論文リスト