SaiT: Sparse Vision Transformers through Adaptive Token Pruning
- URL: http://arxiv.org/abs/2210.05832v1
- Date: Tue, 11 Oct 2022 23:26:42 GMT
- Title: SaiT: Sparse Vision Transformers through Adaptive Token Pruning
- Authors: Ling Li, David Thorsley, Joseph Hassoun
- Abstract summary: Sparse adaptive image Transformer (SaiT) offers varying levels of model acceleration by merely changing the token sparsity on the fly.
SaiT reduces the complexity (FLOPs) by 39% - 43% and increases the throughput by 67% - 91% with less than 0.5% accuracy loss.
- Score: 5.1477382898520485
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While vision transformers have achieved impressive results, effectively and
efficiently accelerating these models can further boost performances. In this
work, we propose a dense/sparse training framework to obtain a unified model,
enabling weight sharing across various token densities. Thus one model offers a
range of accuracy and throughput tradeoffs for different applications. Besides,
we introduce adaptive token pruning to optimize the patch token sparsity based
on the input image. In addition, we investigate knowledge distillation to
enhance token selection capability in early transformer modules. Sparse
adaptive image Transformer (SaiT) offers varying levels of model acceleration
by merely changing the token sparsity on the fly. Specifically, SaiT reduces
the computation complexity (FLOPs) by 39% - 43% and increases the throughput by
67% - 91% with less than 0.5% accuracy loss for various vision transformer
models. Meanwhile, the same model also provides the zero accuracy drop option
by skipping the sparsification step. SaiT achieves better accuracy and
computation tradeoffs than state-of-the-art transformer and convolutional
models.
Related papers
- Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation [73.31524865643709]
We present a plug-and-play pruning-and-recovering framework, called Hourglass Tokenizer (HoT), for efficient transformer-based 3D pose estimation from videos.
Our HoDT begins with pruning pose tokens of redundant frames and ends with recovering full-length tokens, resulting in a few pose tokens in the intermediate transformer blocks.
Our method can achieve both high efficiency and estimation accuracy compared to the original VPT models.
arXiv Detail & Related papers (2023-11-20T18:59:51Z) - CageViT: Convolutional Activation Guided Efficient Vision Transformer [90.69578999760206]
This paper presents an efficient vision Transformer, called CageViT, that is guided by convolutional activation to reduce computation.
Our CageViT, unlike current Transformers, utilizes a new encoder to handle the rearranged tokens.
Experimental results demonstrate that the proposed CageViT outperforms the most recent state-of-the-art backbones by a large margin in terms of efficiency.
arXiv Detail & Related papers (2023-05-17T03:19:18Z) - Transformers For Recognition In Overhead Imagery: A Reality Check [0.0]
We compare the impact of adding transformer structures into state-of-the-art segmentation models for overhead imagery.
Our results suggest that transformers provide consistent, but modest, performance improvements.
arXiv Detail & Related papers (2022-10-23T02:17:31Z) - AdaViT: Adaptive Vision Transformers for Efficient Image Recognition [78.07924262215181]
We introduce AdaViT, an adaptive framework that learns to derive usage policies on which patches, self-attention heads and transformer blocks to use.
Our method obtains more than 2x improvement on efficiency compared to state-of-the-art vision transformers with only 0.8% drop of accuracy.
arXiv Detail & Related papers (2021-11-30T18:57:02Z) - ATS: Adaptive Token Sampling For Efficient Vision Transformers [33.297806854292155]
We introduce a differentiable parameter-free Adaptive Token Sampling (ATS) module, which can be plugged into any existing vision transformer architecture.
ATS empowers vision transformers by scoring and adaptively sampling significant tokens.
Our evaluations show that the proposed module improves the state-of-the-art by reducing the computational cost (GFLOPs) by 37% while preserving the accuracy.
arXiv Detail & Related papers (2021-11-30T18:56:57Z) - Efficient Vision Transformers via Fine-Grained Manifold Distillation [96.50513363752836]
Vision transformer architectures have shown extraordinary performance on many computer vision tasks.
Although the network performance is boosted, transformers are often required more computational resources.
We propose to excavate useful information from the teacher transformer through the relationship between images and the divided patches.
arXiv Detail & Related papers (2021-07-03T08:28:34Z) - DynamicViT: Efficient Vision Transformers with Dynamic Token
Sparsification [134.9393799043401]
We propose a dynamic token sparsification framework to prune redundant tokens based on the input.
By hierarchically pruning 66% of the input tokens, our method greatly reduces 31%37% FLOPs and improves the throughput by over 40%.
DynamicViT models can achieve very competitive complexity/accuracy trade-offs compared to state-of-the-art CNNs and vision transformers on ImageNet.
arXiv Detail & Related papers (2021-06-03T17:57:41Z) - Efficient pre-training objectives for Transformers [84.64393460397471]
We study several efficient pre-training objectives for Transformers-based models.
We prove that eliminating the MASK token and considering the whole output during the loss are essential choices to improve performance.
arXiv Detail & Related papers (2021-04-20T00:09:37Z) - AxFormer: Accuracy-driven Approximation of Transformers for Faster,
Smaller and more Accurate NLP Models [4.247712017691596]
AxFormer is a framework that applies accuracy-driven approximations to create optimized transformer models for a given downstream task.
Our experiments show that AxFormer models are up to 4.5% more accurate, while also being up to 2.5X faster and up to 3.2X smaller than conventional fine-tuned models.
arXiv Detail & Related papers (2020-10-07T23:29:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.