Fugu-MT 論文翻訳(概要): Accelerating Vision Foundation Models with Drop-in Depthwise Convolution

論文の概要: Accelerating Vision Foundation Models with Drop-in Depthwise Convolution

arxiv url: http://arxiv.org/abs/2605.22132v1
Date: Thu, 21 May 2026 08:07:23 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-22 16:35:42.15226
Title: Accelerating Vision Foundation Models with Drop-in Depthwise Convolution
Title（参考訳）: 奥行き畳み込みによる視覚基礎モデルの高速化
Authors: Carmelo Scribano, Mohammad Mahdi, Nedyalko Prisadnikov, Yuqian Fu, Giorgia Franchini, Danda Pani Paudel, Marko Bertogna, Luc Van Gool,
Abstract要約: 我々は、注目ヘッドのドロップイン代替として機能する、効率的な奥行き畳み込みベースの層を導入する。画像分類とセグメンテーションの両方のタスクにおいて,提案手法は性能劣化を最小限に抑えながら17～20%の推論高速化を実現している。
参考スコア（独自算出の注目度）: 51.50107862675191
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Pretrained vision foundation models deliver strong performance across tasks with limited fine-tuning. However, their Vision Transformer (ViT) backbones impose high inference costs, limiting deployment on resource-constrained devices. In this work, we accelerate large-scale pretrained ViTs while preserving their feature extraction capabilities by exploiting the intrinsic convolution-like behavior of some attention heads. Specifically, we introduce an efficient depthwise convolution-based layer that serves as a drop-in replacement for these heads. Additionally, we propose simple strategies to identify which heads can be replaced and introduce a fine-tuning procedure that recovers downstream task performance. Across both image classification and segmentation tasks, our method achieves 17-20\% percent inference speedup with minimal performance degradation. We validate the approach through detailed derivations, extensive experiments, and efficiency benchmarks. The reference implementation is publicly available.
Abstract（参考訳）: 事前訓練されたビジョンファウンデーションモデルは、微調整が限定されたタスク間で強力なパフォーマンスを提供する。しかし、ViT(Vision Transformer)バックボーンは高い推論コストを課し、リソース制約のあるデバイスへのデプロイメントを制限している。そこで本研究では,本研究は,注目頭部の内在的畳み込み様挙動を利用して,特徴抽出能力を保ちながら,大規模事前学習VTを高速化する。具体的には、これらのヘッドのドロップイン代替として機能する、効率的な奥行き畳み込みベースの層を導入する。さらに、どのヘッドを交換できるかを識別するための簡単な戦略を提案し、下流タスクのパフォーマンスを回復する微調整手順を導入する。画像分類とセグメンテーションの両方のタスクにおいて,提案手法は性能劣化を最小限に抑えた17～20%の推論高速化を実現している。我々は、詳細な導出、広範な実験、効率ベンチマークを通じてアプローチを検証する。リファレンス実装は公開されています。

論文の概要: Accelerating Vision Foundation Models with Drop-in Depthwise Convolution

関連論文リスト