Making Vision Transformers Efficient from A Token Sparsification View
- URL: http://arxiv.org/abs/2303.08685v2
- Date: Thu, 30 Mar 2023 11:56:29 GMT
- Title: Making Vision Transformers Efficient from A Token Sparsification View
- Authors: Shuning Chang, Pichao Wang, Ming Lin, Fan Wang, David Junhao Zhang,
Rong Jin, Mike Zheng Shou
- Abstract summary: We propose a novel Semantic Token ViT (STViT) for efficient global and local vision transformers.
Our method can achieve competitive results compared to the original networks in object detection and instance segmentation, with over 30% FLOPs reduction for backbone.
In addition, we design a STViT-R(ecover) network to restore the detailed spatial information based on the STViT, making it work for downstream tasks.
- Score: 26.42498120556985
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The quadratic computational complexity to the number of tokens limits the
practical applications of Vision Transformers (ViTs). Several works propose to
prune redundant tokens to achieve efficient ViTs. However, these methods
generally suffer from (i) dramatic accuracy drops, (ii) application difficulty
in the local vision transformer, and (iii) non-general-purpose networks for
downstream tasks. In this work, we propose a novel Semantic Token ViT (STViT),
for efficient global and local vision transformers, which can also be revised
to serve as backbone for downstream tasks. The semantic tokens represent
cluster centers, and they are initialized by pooling image tokens in space and
recovered by attention, which can adaptively represent global or local semantic
information. Due to the cluster properties, a few semantic tokens can attain
the same effect as vast image tokens, for both global and local vision
transformers. For instance, only 16 semantic tokens on DeiT-(Tiny,Small,Base)
can achieve the same accuracy with more than 100% inference speed improvement
and nearly 60% FLOPs reduction; on Swin-(Tiny,Small,Base), we can employ 16
semantic tokens in each window to further speed it up by around 20% with slight
accuracy increase. Besides great success in image classification, we also
extend our method to video recognition. In addition, we design a
STViT-R(ecover) network to restore the detailed spatial information based on
the STViT, making it work for downstream tasks, which is powerless for previous
token sparsification methods. Experiments demonstrate that our method can
achieve competitive results compared to the original networks in object
detection and instance segmentation, with over 30% FLOPs reduction for
backbone. Code is available at http://github.com/changsn/STViT-R
Related papers
- TCFormer: Visual Recognition via Token Clustering Transformer [79.24723479088097]
We propose the Token Clustering Transformer (TCFormer), which generates dynamic vision tokens based on semantic meaning.
Our dynamic tokens possess two crucial characteristics: (1) Representing image regions with similar semantic meanings using the same vision token, even if those regions are not adjacent, and (2) concentrating on regions with valuable details and represent them using fine tokens.
arXiv Detail & Related papers (2024-07-16T02:26:18Z) - GTP-ViT: Efficient Vision Transformers via Graph-based Token Propagation [30.343504537684755]
Vision Transformers (ViTs) have revolutionized the field of computer vision, yet their deployments on resource-constrained devices remain challenging.
To expedite ViTs, token pruning and token merging approaches have been developed, which aim at reducing the number of tokens involved in computation.
We introduce a novel Graph-based Token Propagation (GTP) method to resolve the challenge of balancing model efficiency and information preservation for efficient ViTs.
arXiv Detail & Related papers (2023-11-06T11:14:19Z) - AiluRus: A Scalable ViT Framework for Dense Prediction [95.1313839257891]
Vision transformers (ViTs) have emerged as a prevalent architecture for vision tasks owing to their impressive performance.
We propose to apply adaptive resolution for different regions in the image according to their importance.
We evaluate our proposed method on three different datasets and observe promising performance.
arXiv Detail & Related papers (2023-11-02T12:48:43Z) - Vision Transformer with Super Token Sampling [93.70963123497327]
Vision transformer has achieved impressive performance for many vision tasks.
It may suffer from high redundancy in capturing local features for shallow layers.
Super tokens attempt to provide a semantically meaningful tessellation of visual content.
arXiv Detail & Related papers (2022-11-21T03:48:13Z) - Vision Transformer with Progressive Sampling [73.60630716500154]
We propose an iterative and progressive sampling strategy to locate discriminative regions.
When trained from scratch on ImageNet, PS-ViT performs 3.8% higher than the vanilla ViT in terms of top-1 accuracy.
arXiv Detail & Related papers (2021-08-03T18:04:31Z) - DynamicViT: Efficient Vision Transformers with Dynamic Token
Sparsification [134.9393799043401]
We propose a dynamic token sparsification framework to prune redundant tokens based on the input.
By hierarchically pruning 66% of the input tokens, our method greatly reduces 31%37% FLOPs and improves the throughput by over 40%.
DynamicViT models can achieve very competitive complexity/accuracy trade-offs compared to state-of-the-art CNNs and vision transformers on ImageNet.
arXiv Detail & Related papers (2021-06-03T17:57:41Z) - Tokens-to-Token ViT: Training Vision Transformers from Scratch on
ImageNet [128.96032932640364]
We propose a new Tokens-To-Token Vision Transformers (T2T-ViT) to solve vision tasks.
T2T-ViT reduces the parameter counts and MACs of vanilla ViT by 200%, while achieving more than 2.5% improvement when trained from scratch on ImageNet.
For example, T2T-ViT with ResNet50 comparable size can achieve 80.7% top-1 accuracy on ImageNet.
arXiv Detail & Related papers (2021-01-28T13:25:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.