Not All Patches are What You Need: Expediting Vision Transformers via
Token Reorganizations
- URL: http://arxiv.org/abs/2202.07800v1
- Date: Wed, 16 Feb 2022 00:19:42 GMT
- Title: Not All Patches are What You Need: Expediting Vision Transformers via
Token Reorganizations
- Authors: Youwei Liang, Chongjian Ge, Zhan Tong, Yibing Song, Jue Wang, Pengtao
Xie
- Abstract summary: Vision Transformers (ViTs) take all the image patches as tokens and construct multi-head self-attention (MHSA) among them.
Examples include tokens containing semantically meaningless or distractive image backgrounds.
We propose to reorganize image tokens during the feed-forward process of ViT models, which is integrated into ViT during training.
- Score: 37.11387992603467
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision Transformers (ViTs) take all the image patches as tokens and construct
multi-head self-attention (MHSA) among them. Complete leverage of these image
tokens brings redundant computations since not all the tokens are attentive in
MHSA. Examples include that tokens containing semantically meaningless or
distractive image backgrounds do not positively contribute to the ViT
predictions. In this work, we propose to reorganize image tokens during the
feed-forward process of ViT models, which is integrated into ViT during
training. For each forward inference, we identify the attentive image tokens
between MHSA and FFN (i.e., feed-forward network) modules, which is guided by
the corresponding class token attention. Then, we reorganize image tokens by
preserving attentive image tokens and fusing inattentive ones to expedite
subsequent MHSA and FFN computations. To this end, our method EViT improves
ViTs from two perspectives. First, under the same amount of input image tokens,
our method reduces MHSA and FFN computation for efficient inference. For
instance, the inference speed of DeiT-S is increased by 50% while its
recognition accuracy is decreased by only 0.3% for ImageNet classification.
Second, by maintaining the same computational cost, our method empowers ViTs to
take more image tokens as input for recognition accuracy improvement, where the
image tokens are from higher resolution images. An example is that we improve
the recognition accuracy of DeiT-S by 1% for ImageNet classification at the
same computational cost of a vanilla DeiT-S. Meanwhile, our method does not
introduce more parameters to ViTs. Experiments on the standard benchmarks show
the effectiveness of our method. The code is available at
https://github.com/youweiliang/evit
Related papers
- ToSA: Token Selective Attention for Efficient Vision Transformers [50.13756218204456]
ToSA is a token selective attention approach that can identify tokens that need to be attended as well as those that can skip a transformer layer.
We show that ToSA can significantly reduce computation costs while maintaining accuracy on the ImageNet classification benchmark.
arXiv Detail & Related papers (2024-06-13T05:17:21Z) - LeMeViT: Efficient Vision Transformer with Learnable Meta Tokens for Remote Sensing Image Interpretation [37.72775203647514]
This paper proposes to use learnable meta tokens to formulate sparse tokens, which effectively learn key information and improve inference speed.
By employing Dual Cross-Attention (DCA) in the early stages with dense visual tokens, we obtain the hierarchical architecture LeMeViT with various sizes.
Experimental results in classification and dense prediction tasks show that LeMeViT has a significant $1.7 times$ speedup, fewer parameters, and competitive performance compared to the baseline models.
arXiv Detail & Related papers (2024-05-16T03:26:06Z) - SkipViT: Speeding Up Vision Transformers with a Token-Level Skip
Connection [3.960622297616708]
We propose a method to optimize the amount of unnecessary interactions between unimportant tokens by separating and sending them through a different low-cost computational path.
Our experimental results on training ViT-small from scratch show that SkipViT is capable of effectively dropping 55% of the tokens while gaining more than 13% training throughput.
arXiv Detail & Related papers (2024-01-27T04:24:49Z) - No Token Left Behind: Efficient Vision Transformer via Dynamic Token
Idling [55.203866875294516]
Vision Transformers (ViTs) have demonstrated outstanding performance in computer vision tasks.
Various token pruning techniques have been introduced to alleviate the high computational burden of ViTs.
We propose IdleViT, a dynamic token-idle-based method that achieves an excellent trade-off between performance and efficiency.
arXiv Detail & Related papers (2023-10-09T12:10:41Z) - Make A Long Image Short: Adaptive Token Length for Vision Transformers [5.723085628967456]
We propose an innovative approach to accelerate the ViT model by shortening long images.
Specifically, we introduce a method for adaptively assigning token length for each image at test time to accelerate inference speed.
arXiv Detail & Related papers (2023-07-05T08:10:17Z) - Multi-Scale And Token Mergence: Make Your ViT More Efficient [3.087140219508349]
Vision Transformer (ViT) has emerged as a prevalent model in the computer vision domain.
We propose a novel token pruning method that retains information from non-crucial tokens by merging them with more crucial tokens.
Our method achieves a remarkable 33% reduction in computational costs while only incurring a 0.1% decrease in accuracy on DeiT-S.
arXiv Detail & Related papers (2023-06-08T02:58:15Z) - Making Vision Transformers Efficient from A Token Sparsification View [26.42498120556985]
We propose a novel Semantic Token ViT (STViT) for efficient global and local vision transformers.
Our method can achieve competitive results compared to the original networks in object detection and instance segmentation, with over 30% FLOPs reduction for backbone.
In addition, we design a STViT-R(ecover) network to restore the detailed spatial information based on the STViT, making it work for downstream tasks.
arXiv Detail & Related papers (2023-03-15T15:12:36Z) - Token-Label Alignment for Vision Transformers [93.58540411138164]
Data mixing strategies (e.g., CutMix) have shown the ability to greatly improve the performance of convolutional neural networks (CNNs)
We identify a token fluctuation phenomenon that has suppressed the potential of data mixing strategies.
We propose a token-label alignment (TL-Align) method to trace the correspondence between transformed tokens and the original tokens to maintain a label for each token.
arXiv Detail & Related papers (2022-10-12T17:54:32Z) - DynamicViT: Efficient Vision Transformers with Dynamic Token
Sparsification [134.9393799043401]
We propose a dynamic token sparsification framework to prune redundant tokens based on the input.
By hierarchically pruning 66% of the input tokens, our method greatly reduces 31%37% FLOPs and improves the throughput by over 40%.
DynamicViT models can achieve very competitive complexity/accuracy trade-offs compared to state-of-the-art CNNs and vision transformers on ImageNet.
arXiv Detail & Related papers (2021-06-03T17:57:41Z) - Tokens-to-Token ViT: Training Vision Transformers from Scratch on
ImageNet [128.96032932640364]
We propose a new Tokens-To-Token Vision Transformers (T2T-ViT) to solve vision tasks.
T2T-ViT reduces the parameter counts and MACs of vanilla ViT by 200%, while achieving more than 2.5% improvement when trained from scratch on ImageNet.
For example, T2T-ViT with ResNet50 comparable size can achieve 80.7% top-1 accuracy on ImageNet.
arXiv Detail & Related papers (2021-01-28T13:25:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.