ATS: Adaptive Token Sampling For Efficient Vision Transformers
- URL: http://arxiv.org/abs/2111.15667v1
- Date: Tue, 30 Nov 2021 18:56:57 GMT
- Title: ATS: Adaptive Token Sampling For Efficient Vision Transformers
- Authors: Mohsen Fayyaz, Soroush Abbasi Kouhpayegani, Farnoush Rezaei Jafari,
Eric Sommerlade, Hamid Reza Vaezi Joze, Hamed Pirsiavash, Juergen Gall
- Abstract summary: We introduce a differentiable parameter-free Adaptive Token Sampling (ATS) module, which can be plugged into any existing vision transformer architecture.
ATS empowers vision transformers by scoring and adaptively sampling significant tokens.
Our evaluations show that the proposed module improves the state-of-the-art by reducing the computational cost (GFLOPs) by 37% while preserving the accuracy.
- Score: 33.297806854292155
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While state-of-the-art vision transformer models achieve promising results
for image classification, they are computationally very expensive and require
many GFLOPs. Although the GFLOPs of a vision transformer can be decreased by
reducing the number of tokens in the network, there is no setting that is
optimal for all input images. In this work, we, therefore, introduce a
differentiable parameter-free Adaptive Token Sampling (ATS) module, which can
be plugged into any existing vision transformer architecture. ATS empowers
vision transformers by scoring and adaptively sampling significant tokens. As a
result, the number of tokens is not anymore static but it varies for each input
image. By integrating ATS as an additional layer within current transformer
blocks, we can convert them into much more efficient vision transformers with
an adaptive number of tokens. Since ATS is a parameter-free module, it can be
added to off-the-shelf pretrained vision transformers as a plug-and-play
module, thus reducing their GFLOPs without any additional training. However,
due to its differentiable design, one can also train a vision transformer
equipped with ATS. We evaluate our module on the ImageNet dataset by adding it
to multiple state-of-the-art vision transformers. Our evaluations show that the
proposed module improves the state-of-the-art by reducing the computational
cost (GFLOPs) by 37% while preserving the accuracy.
Related papers
- SparseSwin: Swin Transformer with Sparse Transformer Block [1.7243216387069678]
This paper aims to reduce the number of parameters and in turn, made the transformer more efficient.
We present Sparse Transformer (SparTa) Block, a modified transformer block with an addition of a sparse token converter.
The proposed SparseSwin model outperforms other state of the art models in image classification with an accuracy of 86.96%, 97.43%, and 85.35% on the ImageNet100, CIFAR10, and CIFAR100 datasets respectively.
arXiv Detail & Related papers (2023-09-11T04:03:43Z) - Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles [65.54857068975068]
In this paper, we argue that this additional bulk is unnecessary.
By pretraining with a strong visual pretext task (MAE), we can strip out all the bells-and-whistles from a state-of-the-art multi-stage vision transformer.
We create Hiera, an extremely simple hierarchical vision transformer that is more accurate than previous models.
arXiv Detail & Related papers (2023-06-01T17:59:58Z) - CageViT: Convolutional Activation Guided Efficient Vision Transformer [90.69578999760206]
This paper presents an efficient vision Transformer, called CageViT, that is guided by convolutional activation to reduce computation.
Our CageViT, unlike current Transformers, utilizes a new encoder to handle the rearranged tokens.
Experimental results demonstrate that the proposed CageViT outperforms the most recent state-of-the-art backbones by a large margin in terms of efficiency.
arXiv Detail & Related papers (2023-05-17T03:19:18Z) - Reversible Vision Transformers [74.3500977090597]
Reversible Vision Transformers are a memory efficient architecture for visual recognition.
We adapt two popular models, namely Vision Transformer and Multiscale Vision Transformers, to reversible variants.
We find that the additional computational burden of recomputing activations is more than overcome for deeper models.
arXiv Detail & Related papers (2023-02-09T18:59:54Z) - HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling [126.89573619301953]
We propose a new design of hierarchical vision transformers named HiViT (short for Hierarchical ViT)
HiViT enjoys both high efficiency and good performance in MIM.
In running MAE on ImageNet-1K, HiViT-B reports a +0.6% accuracy gain over ViT-B and a 1.9$times$ speed-up over Swin-B.
arXiv Detail & Related papers (2022-05-30T09:34:44Z) - Multi-Tailed Vision Transformer for Efficient Inference [44.43126137573205]
Vision Transformer (ViT) has achieved promising performance in image recognition.
We propose a Multi-Tailed Vision Transformer (MT-ViT) in the paper.
MT-ViT adopts multiple tails to produce visual sequences of different lengths for the following Transformer encoder.
arXiv Detail & Related papers (2022-03-03T09:30:55Z) - Vision Transformer with Progressive Sampling [73.60630716500154]
We propose an iterative and progressive sampling strategy to locate discriminative regions.
When trained from scratch on ImageNet, PS-ViT performs 3.8% higher than the vanilla ViT in terms of top-1 accuracy.
arXiv Detail & Related papers (2021-08-03T18:04:31Z) - DynamicViT: Efficient Vision Transformers with Dynamic Token
Sparsification [134.9393799043401]
We propose a dynamic token sparsification framework to prune redundant tokens based on the input.
By hierarchically pruning 66% of the input tokens, our method greatly reduces 31%37% FLOPs and improves the throughput by over 40%.
DynamicViT models can achieve very competitive complexity/accuracy trade-offs compared to state-of-the-art CNNs and vision transformers on ImageNet.
arXiv Detail & Related papers (2021-06-03T17:57:41Z) - Incorporating Convolution Designs into Visual Transformers [24.562955955312187]
We propose a new textbfConvolution-enhanced image Transformer (CeiT) which combines the advantages of CNNs in extracting low-level features, strengthening locality, and the advantages of Transformers in establishing long-range dependencies.
Experimental results on ImageNet and seven downstream tasks show the effectiveness and generalization ability of CeiT compared with previous Transformers and state-of-the-art CNNs, without requiring a large amount of training data and extra CNN teachers.
arXiv Detail & Related papers (2021-03-22T13:16:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.