AdaViT: Adaptive Tokens for Efficient Vision Transformer
- URL: http://arxiv.org/abs/2112.07658v1
- Date: Tue, 14 Dec 2021 18:56:07 GMT
- Title: AdaViT: Adaptive Tokens for Efficient Vision Transformer
- Authors: Hongxu Yin, Arash Vahdat, Jose Alvarez, Arun Mallya, Jan Kautz, Pavlo
Molchanov
- Abstract summary: We introduce AdaViT, a method that adaptively adjusts the inference cost of vision transformer (ViT) for images of different complexity.
AdaViT achieves this by automatically reducing the number of tokens in vision transformers that are processed in the network as inference proceeds.
- Score: 91.88404546243113
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce AdaViT, a method that adaptively adjusts the inference cost of
vision transformer (ViT) for images of different complexity. AdaViT achieves
this by automatically reducing the number of tokens in vision transformers that
are processed in the network as inference proceeds. We reformulate Adaptive
Computation Time (ACT) for this task, extending halting to discard redundant
spatial tokens. The appealing architectural properties of vision transformers
enables our adaptive token reduction mechanism to speed up inference without
modifying the network architecture or inference hardware. We demonstrate that
AdaViT requires no extra parameters or sub-network for halting, as we base the
learning of adaptive halting on the original network parameters. We further
introduce distributional prior regularization that stabilizes training compared
to prior ACT approaches. On the image classification task (ImageNet1K), we show
that our proposed AdaViT yields high efficacy in filtering informative spatial
features and cutting down on the overall compute. The proposed method improves
the throughput of DeiT-Tiny by 62% and DeiT-Small by 38% with only 0.3%
accuracy drop, outperforming prior art by a large margin.
Related papers
- Sparse-Tuning: Adapting Vision Transformers with Efficient Fine-tuning and Inference [14.030836300221756]
textbfSparse-Tuning is a novel PEFT method that accounts for the information redundancy in images and videos.
Sparse-Tuning minimizes the quantity of tokens processed at each layer, leading to a quadratic reduction in computational and memory overhead.
Our results show that our Sparse-Tuning reduces GFLOPs to textbf62%-70% of the original ViT-B while achieving state-of-the-art performance.
arXiv Detail & Related papers (2024-05-23T15:34:53Z) - Dynamic Tuning Towards Parameter and Inference Efficiency for ViT Adaptation [67.13876021157887]
Dynamic Tuning (DyT) is a novel approach to improve both parameter and inference efficiency for ViT adaptation.
DyT achieves superior performance compared to existing PEFT methods while evoking only 71% of their FLOPs on the VTAB-1K benchmark.
arXiv Detail & Related papers (2024-03-18T14:05:52Z) - LF-ViT: Reducing Spatial Redundancy in Vision Transformer for Efficient
Image Recognition [9.727093171296678]
Vision Transformer (ViT) excels in accuracy when handling high-resolution images.
It confronts the challenge of significant spatial redundancy, leading to increased computational and memory requirements.
We present the Localization and Focus Vision Transformer (LF-ViT)
It operates by strategically curtailing computational demands without impinging on performance.
arXiv Detail & Related papers (2024-01-08T01:32:49Z) - TPC-ViT: Token Propagation Controller for Efficient Vision Transformer [6.341420717393898]
Vision transformers (ViTs) have achieved promising results on a variety of Computer Vision tasks.
Previous approaches that employ gradual token reduction to address this challenge assume that token redundancy in one layer implies redundancy in all the following layers.
We propose a novel token propagation controller (TPC) that incorporates two different token-distributions.
arXiv Detail & Related papers (2024-01-03T00:10:33Z) - CageViT: Convolutional Activation Guided Efficient Vision Transformer [90.69578999760206]
This paper presents an efficient vision Transformer, called CageViT, that is guided by convolutional activation to reduce computation.
Our CageViT, unlike current Transformers, utilizes a new encoder to handle the rearranged tokens.
Experimental results demonstrate that the proposed CageViT outperforms the most recent state-of-the-art backbones by a large margin in terms of efficiency.
arXiv Detail & Related papers (2023-05-17T03:19:18Z) - Vicinity Vision Transformer [53.43198716947792]
We present a Vicinity Attention that introduces a locality bias to vision transformers with linear complexity.
Our approach achieves state-of-the-art image classification accuracy with 50% fewer parameters than previous methods.
arXiv Detail & Related papers (2022-06-21T17:33:53Z) - AdaViT: Adaptive Vision Transformers for Efficient Image Recognition [78.07924262215181]
We introduce AdaViT, an adaptive framework that learns to derive usage policies on which patches, self-attention heads and transformer blocks to use.
Our method obtains more than 2x improvement on efficiency compared to state-of-the-art vision transformers with only 0.8% drop of accuracy.
arXiv Detail & Related papers (2021-11-30T18:57:02Z) - Vision Transformer with Progressive Sampling [73.60630716500154]
We propose an iterative and progressive sampling strategy to locate discriminative regions.
When trained from scratch on ImageNet, PS-ViT performs 3.8% higher than the vanilla ViT in terms of top-1 accuracy.
arXiv Detail & Related papers (2021-08-03T18:04:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.