Patch Slimming for Efficient Vision Transformers
- URL: http://arxiv.org/abs/2106.02852v1
- Date: Sat, 5 Jun 2021 09:46:00 GMT
- Title: Patch Slimming for Efficient Vision Transformers
- Authors: Yehui Tang, Kai Han, Yunhe Wang, Chang Xu, Jianyuan Guo, Chao Xu,
Dacheng Tao
- Abstract summary: We study the efficiency problem for visual transformers by excavating redundant calculation in given networks.
We present a novel patch slimming approach that discards useless patches in a top-down paradigm.
Experimental results on benchmark datasets demonstrate that the proposed method can significantly reduce the computational costs of vision transformers.
- Score: 107.21146699082819
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper studies the efficiency problem for visual transformers by
excavating redundant calculation in given networks. The recent transformer
architecture has demonstrated its effectiveness for achieving excellent
performance on a series of computer vision tasks. However, similar to that of
convolutional neural networks, the huge computational cost of vision
transformers is still a severe issue. Considering that the attention mechanism
aggregates different patches layer-by-layer, we present a novel patch slimming
approach that discards useless patches in a top-down paradigm. We first
identify the effective patches in the last layer and then use them to guide the
patch selection process of previous layers. For each layer, the impact of a
patch on the final output feature is approximated and patches with less impact
will be removed. Experimental results on benchmark datasets demonstrate that
the proposed method can significantly reduce the computational costs of vision
transformers without affecting their performances. For example, over 45% FLOPs
of the ViT-Ti model can be reduced with only 0.2% top-1 accuracy drop on the
ImageNet dataset.
Related papers
- CageViT: Convolutional Activation Guided Efficient Vision Transformer [90.69578999760206]
This paper presents an efficient vision Transformer, called CageViT, that is guided by convolutional activation to reduce computation.
Our CageViT, unlike current Transformers, utilizes a new encoder to handle the rearranged tokens.
Experimental results demonstrate that the proposed CageViT outperforms the most recent state-of-the-art backbones by a large margin in terms of efficiency.
arXiv Detail & Related papers (2023-05-17T03:19:18Z) - Applying Plain Transformers to Real-World Point Clouds [0.0]
This work revisits the plain transformers in real-world point cloud understanding.
To close the performance gap due to the lack of inductive bias, we investigate self-supervised pre-training with masked autoencoder (MAE)
Our models achieve SOTA results in semantic segmentation on the S3DIS dataset and object detection on the ScanNet dataset with lower computational costs.
arXiv Detail & Related papers (2023-02-28T21:06:36Z) - Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer [56.87383229709899]
We develop an information rectification module (IRM) and a distribution guided distillation scheme for fully quantized vision transformers (Q-ViT)
Our method achieves a much better performance than the prior arts.
arXiv Detail & Related papers (2022-10-13T04:00:29Z) - Vicinity Vision Transformer [53.43198716947792]
We present a Vicinity Attention that introduces a locality bias to vision transformers with linear complexity.
Our approach achieves state-of-the-art image classification accuracy with 50% fewer parameters than previous methods.
arXiv Detail & Related papers (2022-06-21T17:33:53Z) - Three things everyone should know about Vision Transformers [67.30250766591405]
transformer architectures have rapidly gained traction in computer vision.
We offer three insights based on simple and easy to implement variants of vision transformers.
We evaluate the impact of these design choices using the ImageNet-1k dataset, and confirm our findings on the ImageNet-v2 test set.
arXiv Detail & Related papers (2022-03-18T08:23:03Z) - IA-RED$^2$: Interpretability-Aware Redundancy Reduction for Vision
Transformers [81.31885548824926]
Self-attention-based model, transformer, is recently becoming the leading backbone in the field of computer vision.
We present an Interpretability-Aware REDundancy REDuction framework (IA-RED$2$)
We include extensive experiments on both image and video tasks, where our method could deliver up to 1.4X speed-up.
arXiv Detail & Related papers (2021-06-23T18:29:23Z) - Improve Vision Transformers Training by Suppressing Over-smoothing [28.171262066145612]
Introducing the transformer structure into computer vision tasks holds the promise of yielding a better speed-accuracy trade-off than traditional convolution networks.
However, directly training vanilla transformers on vision tasks has been shown to yield unstable and sub-optimal results.
Recent works propose to modify transformer structures by incorporating convolutional layers to improve the performance on vision tasks.
arXiv Detail & Related papers (2021-04-26T17:43:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.