Fugu-MT 論文翻訳(概要): Making Vision Transformers Efficient from A Token Sparsification View

論文の概要: Making Vision Transformers Efficient from A Token Sparsification View

arxiv url: http://arxiv.org/abs/2303.08685v1
Date: Wed, 15 Mar 2023 15:12:36 GMT
ステータス: 翻訳完了
システム内更新日: 2023-03-16 13:23:54.191919
Title: Making Vision Transformers Efficient from A Token Sparsification View
Title（参考訳）: トークンスカラー化による視覚変換器の効率化
Authors: Shuning Chang, Pichao Wang, Ming Lin, Fan Wang, David Junhao Zhang, Rong Jin, Mike Zheng Shou
Abstract要約: 本稿では,グローバル・ローカル・ビジョン・トランスフォーマのための新しいセマンティック・トークンViT(STViT)を提案する。クラスタの性質のため、グローバルとローカルの両方のビジョントランスフォーマーにおいて、いくつかのセマンティックトークンは巨大な画像トークンと同じ効果が得られる。提案手法は,対象検出やインスタンスセグメンテーションにおける元のネットワークと比較して,30%以上のFLOPを削減できる。
参考スコア（独自算出の注目度）: 26.42498120556985
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The quadratic computational complexity to the number of tokens limits the practical applications of Vision Transformers (ViTs). Several works propose to prune redundant tokens to achieve efficient ViTs. However, these methods generally suffer from (i) dramatic accuracy drops, (ii) application difficulty in the local vision transformer, and (iii) non-general-purpose networks for downstream tasks. In this work, we propose a novel Semantic Token ViT (STViT), for efficient global and local vision transformers, which can also be revised to serve as backbone for downstream tasks. The semantic tokens represent cluster centers, and they are initialized by pooling image tokens in space and recovered by attention, which can adaptively represent global or local semantic information. Due to the cluster properties, a few semantic tokens can attain the same effect as vast image tokens, for both global and local vision transformers. For instance, only 16 semantic tokens on DeiT-(Tiny,Small,Base) can achieve the same accuracy with more than 100% inference speed improvement and nearly 60% FLOPs reduction; on Swin-(Tiny,Small,Base), we can employ 16 semantic tokens in each window to further speed it up by around 20% with slight accuracy increase. Besides great success in image classification, we also extend our method to video recognition. In addition, we design a STViT-R(ecover) network to restore the detailed spatial information based on the STViT, making it work for downstream tasks, which is powerless for previous token sparsification methods. Experiments demonstrate that our method can achieve competitive results compared to the original networks in object detection and instance segmentation, with over 30% FLOPs reduction for backbone.
Abstract（参考訳）: トークン数に対する二次計算の複雑さは、視覚変換器(ViT)の実用的応用を制限する。いくつかの研究は、効率的なViTを実現するために冗長トークンをプルークすることを提案する。しかしこれらの手法は一般に (i)劇的な精度低下。 (ii)局所視覚変換器の応用難しさ、及び (iii)ダウンストリームタスクのための非汎用ネットワーク。本研究では,効率的なグローバル・ローカル・ビジョン・トランスフォーマーのための新しいセマンティック・トークンViT (STViT) を提案する。セマンティックトークンはクラスタセンターを表し、画像トークンを空間にプールすることで初期化され、グローバルまたはローカルなセマンティック情報を適応的に表現することができる。クラスタ特性のため、グローバルビジョントランスフォーマーとローカルビジョントランスフォーマーの両方において、いくつかのセマンティックトークンは広大なイメージトークンと同じ効果を実現できる。例えば、DeiT-(Tiny,Small,Base)上の16のセマンティックトークンは、100%以上の推論速度の改善と60%近いFLOPの削減で同じ精度を達成することができる。画像分類における大きな成功に加え、我々はこの手法をビデオ認識にも拡張する。さらに,STViTに基づく詳細な空間情報を復元するためのSTViT-R(ecover)ネットワークを設計し,従来のトークンスカラー化手法では無力な下流タスクに対して機能する。実験により,本手法は対象検出やインスタンスセグメンテーションにおける元のネットワークと比較して,30%以上のFLOPを削減できることがわかった。

論文の概要: Making Vision Transformers Efficient from A Token Sparsification View

関連論文リスト