AdaViT: Adaptive Vision Transformers for Efficient Image Recognition
- URL: http://arxiv.org/abs/2111.15668v1
- Date: Tue, 30 Nov 2021 18:57:02 GMT
- Title: AdaViT: Adaptive Vision Transformers for Efficient Image Recognition
- Authors: Lingchen Meng, Hengduo Li, Bor-Chun Chen, Shiyi Lan, Zuxuan Wu,
Yu-Gang Jiang, Ser-Nam Lim
- Abstract summary: We introduce AdaViT, an adaptive framework that learns to derive usage policies on which patches, self-attention heads and transformer blocks to use.
Our method obtains more than 2x improvement on efficiency compared to state-of-the-art vision transformers with only 0.8% drop of accuracy.
- Score: 78.07924262215181
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Built on top of self-attention mechanisms, vision transformers have
demonstrated remarkable performance on a variety of vision tasks recently.
While achieving excellent performance, they still require relatively intensive
computational cost that scales up drastically as the numbers of patches,
self-attention heads and transformer blocks increase. In this paper, we argue
that due to the large variations among images, their need for modeling
long-range dependencies between patches differ. To this end, we introduce
AdaViT, an adaptive computation framework that learns to derive usage policies
on which patches, self-attention heads and transformer blocks to use throughout
the backbone on a per-input basis, aiming to improve inference efficiency of
vision transformers with a minimal drop of accuracy for image recognition.
Optimized jointly with a transformer backbone in an end-to-end manner, a
light-weight decision network is attached to the backbone to produce decisions
on-the-fly. Extensive experiments on ImageNet demonstrate that our method
obtains more than 2x improvement on efficiency compared to state-of-the-art
vision transformers with only 0.8% drop of accuracy, achieving good
efficiency/accuracy trade-offs conditioned on different computational budgets.
We further conduct quantitative and qualitative analysis on learned usage
polices and provide more insights on the redundancy in vision transformers.
Related papers
- Which Transformer to Favor: A Comparative Analysis of Efficiency in Vision Transformers [7.89533262149443]
Self-attention in Transformers comes with a high computational cost because of their quadratic computational complexity.
Our benchmark shows that using a larger model in general is more efficient than using higher resolution images.
arXiv Detail & Related papers (2023-08-18T08:06:49Z) - Image Deblurring by Exploring In-depth Properties of Transformer [86.7039249037193]
We leverage deep features extracted from a pretrained vision transformer (ViT) to encourage recovered images to be sharp without sacrificing the performance measured by the quantitative metrics.
By comparing the transformer features between recovered image and target one, the pretrained transformer provides high-resolution blur-sensitive semantic information.
One regards the features as vectors and computes the discrepancy between representations extracted from recovered image and target one in Euclidean space.
arXiv Detail & Related papers (2023-03-24T14:14:25Z) - Transformers For Recognition In Overhead Imagery: A Reality Check [0.0]
We compare the impact of adding transformer structures into state-of-the-art segmentation models for overhead imagery.
Our results suggest that transformers provide consistent, but modest, performance improvements.
arXiv Detail & Related papers (2022-10-23T02:17:31Z) - Vicinity Vision Transformer [53.43198716947792]
We present a Vicinity Attention that introduces a locality bias to vision transformers with linear complexity.
Our approach achieves state-of-the-art image classification accuracy with 50% fewer parameters than previous methods.
arXiv Detail & Related papers (2022-06-21T17:33:53Z) - Three things everyone should know about Vision Transformers [67.30250766591405]
transformer architectures have rapidly gained traction in computer vision.
We offer three insights based on simple and easy to implement variants of vision transformers.
We evaluate the impact of these design choices using the ImageNet-1k dataset, and confirm our findings on the ImageNet-v2 test set.
arXiv Detail & Related papers (2022-03-18T08:23:03Z) - AdaViT: Adaptive Tokens for Efficient Vision Transformer [91.88404546243113]
We introduce AdaViT, a method that adaptively adjusts the inference cost of vision transformer (ViT) for images of different complexity.
AdaViT achieves this by automatically reducing the number of tokens in vision transformers that are processed in the network as inference proceeds.
arXiv Detail & Related papers (2021-12-14T18:56:07Z) - Efficient Vision Transformers via Fine-Grained Manifold Distillation [96.50513363752836]
Vision transformer architectures have shown extraordinary performance on many computer vision tasks.
Although the network performance is boosted, transformers are often required more computational resources.
We propose to excavate useful information from the teacher transformer through the relationship between images and the divided patches.
arXiv Detail & Related papers (2021-07-03T08:28:34Z) - Improve Vision Transformers Training by Suppressing Over-smoothing [28.171262066145612]
Introducing the transformer structure into computer vision tasks holds the promise of yielding a better speed-accuracy trade-off than traditional convolution networks.
However, directly training vanilla transformers on vision tasks has been shown to yield unstable and sub-optimal results.
Recent works propose to modify transformer structures by incorporating convolutional layers to improve the performance on vision tasks.
arXiv Detail & Related papers (2021-04-26T17:43:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.