Fugu-MT 論文翻訳(概要): Data-independent Module-aware Pruning for Hierarchical Vision Transformers

論文の概要: Data-independent Module-aware Pruning for Hierarchical Vision Transformers

arxiv url: http://arxiv.org/abs/2404.13648v1
Date: Sun, 21 Apr 2024 12:50:38 GMT
ステータス: 翻訳完了
システム内更新日: 2024-04-23 18:01:50.411184
Title: Data-independent Module-aware Pruning for Hierarchical Vision Transformers
Title（参考訳）: 階層型視覚変換器のためのデータ非依存モジュール対応プルーニング
Authors: Yang He, Joey Tianyi Zhou,
Abstract要約: 階層型視覚変換器(ViT)は従来のViTよりも2つの利点がある。まず、階層型ViTは局所的な自己注意による画像サイズに関する線形計算複雑性を実現する。第二に、階層的なViTは階層的な特徴マップを作成し、画像パッチをより深い層にマージして、密度の高い予測を行う。
参考スコア（独自算出の注目度）: 41.92794134275854
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Hierarchical vision transformers (ViTs) have two advantages over conventional ViTs. First, hierarchical ViTs achieve linear computational complexity with respect to image size by local self-attention. Second, hierarchical ViTs create hierarchical feature maps by merging image patches in deeper layers for dense prediction. However, existing pruning methods ignore the unique properties of hierarchical ViTs and use the magnitude value as the weight importance. This approach leads to two main drawbacks. First, the "local" attention weights are compared at a "global" level, which may cause some "locally" important weights to be pruned due to their relatively small magnitude "globally". The second issue with magnitude pruning is that it fails to consider the distinct weight distributions of the network, which are essential for extracting coarse to fine-grained features at various hierarchical levels. To solve the aforementioned issues, we have developed a Data-independent Module-Aware Pruning method (DIMAP) to compress hierarchical ViTs. To ensure that "local" attention weights at different hierarchical levels are compared fairly in terms of their contribution, we treat them as a module and examine their contribution by analyzing their information distortion. Furthermore, we introduce a novel weight metric that is solely based on weights and does not require input images, thereby eliminating the dependence on the patch merging process. Our method validates its usefulness and strengths on Swin Transformers of different sizes on ImageNet-1k classification. Notably, the top-5 accuracy drop is only 0.07% when we remove 52.5% FLOPs and 52.7% parameters of Swin-B. When we reduce 33.2% FLOPs and 33.2% parameters of Swin-S, we can even achieve a 0.8% higher relative top-5 accuracy than the original model. Code is available at: https://github.com/he-y/Data-independent-Module-Aware-Pruning
Abstract（参考訳）: 階層型視覚変換器(ViT)は従来のViTよりも2つの利点がある。まず、階層型ViTは局所的な自己注意による画像サイズに関する線形計算複雑性を実現する。第二に、階層的なViTは階層的な特徴マップを作成し、画像パッチをより深い層にマージして、密度の高い予測を行う。しかし、既存のプルーニング法は階層型 ViT のユニークな性質を無視し、重み付けの重み付けとしてその大きさ値を用いる。このアプローチの主な欠点は2つあります。第一に、「局所的な」注目重量は「グローバル」レベルで比較され、これは比較的小さな大きさの「グローバル」のために「局所的に」重要な重量が刈り取られる可能性がある。マグニチュードプルーニングの2つ目の問題は、様々な階層レベルで粗い特徴から細かな特徴を抽出するのに不可欠である、ネットワークの異なる重量分布を考慮できないことである。この問題を解決するために,データ非依存型モジュール・アウェア・プルーニング法 (DIMAP) を開発した。異なる階層レベルでの「局所的」注意重みが、その貢献度で同等に比較されるように、モジュールとして扱い、情報歪みを分析して貢献度を調べる。さらに、重みのみに基づいて入力画像を必要としない新しい重み計量を導入することにより、パッチマージプロセスへの依存を解消する。画像Net-1k分類において,異なる大きさのスイム変換器に対して,その有用性と強度を検証した。特に、52.5%のFLOPと52.7%のパラメータをSwin-Bから取り除いた場合、トップ5の精度低下は0.07%に過ぎなかった。 33.2%のFLOPと33.2%のパラメータをSwin-Sに還元すると、元のモデルよりも0.8%高い相対的トップ5の精度が得られる。 https://github.com/he-y/Data-independent-Module-Aware-Pruning

論文の概要: Data-independent Module-aware Pruning for Hierarchical Vision Transformers

関連論文リスト