Self-slimmed Vision Transformer
- URL: http://arxiv.org/abs/2111.12624v1
- Date: Wed, 24 Nov 2021 16:48:57 GMT
- Title: Self-slimmed Vision Transformer
- Authors: Zhuofan Zong, Kunchang Li, Guanglu Song, Yali Wang, Yu Qiao, Biao
Leng, Yu Liu
- Abstract summary: Vision transformers (ViTs) have become the popular structures and outperformed convolutional neural networks (CNNs) on various vision tasks.
We propose a generic self-slimmed learning approach for vanilla ViTs, namely SiT.
Specifically, we first design a novel Token Slimming Module (TSM), which can boost the inference efficiency of ViTs.
- Score: 52.67243496139175
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision transformers (ViTs) have become the popular structures and
outperformed convolutional neural networks (CNNs) on various vision tasks.
However, such powerful transformers bring a huge computation burden. And the
essential barrier behind this is the exhausting token-to-token comparison. To
alleviate this, we delve deeply into the model properties of ViT and observe
that ViTs exhibit sparse attention with high token similarity. This intuitively
introduces us a feasible structure-agnostic dimension, token number, to reduce
the computational cost. Based on this exploration, we propose a generic
self-slimmed learning approach for vanilla ViTs, namely SiT. Specifically, we
first design a novel Token Slimming Module (TSM), which can boost the inference
efficiency of ViTs by dynamic token aggregation. Different from the token hard
dropping, our TSM softly integrates redundant tokens into fewer informative
ones, which can dynamically zoom visual attention without cutting off
discriminative token relations in the images. Furthermore, we introduce a
concise Dense Knowledge Distillation (DKD) framework, which densely transfers
unorganized token information in a flexible auto-encoder manner. Due to the
similar structure between teacher and student, our framework can effectively
leverage structure knowledge for better convergence. Finally, we conduct
extensive experiments to evaluate our SiT. It demonstrates that our method can
speed up ViTs by 1.7x with negligible accuracy drop, and even speed up ViTs by
3.6x while maintaining 97% of their performance. Surprisingly, by simply arming
LV-ViT with our SiT, we achieve new state-of-the-art performance on ImageNet,
surpassing all the CNNs and ViTs in the recent literature.
Related papers
- Fibottention: Inceptive Visual Representation Learning with Diverse Attention Across Heads [10.169639612525643]
Visual perception tasks are predominantly solved by ViT, despite their effectiveness.
Despite their effectiveness, ViT encounters a computational bottleneck due to the complexity of computing self-attention.
We propose Fibottention architecture, which approximating self-attention that is built upon.
arXiv Detail & Related papers (2024-06-27T17:59:40Z) - DeiT-LT Distillation Strikes Back for Vision Transformer Training on Long-Tailed Datasets [30.178427266135756]
Vision Transformer (ViT) has emerged as a prominent architecture for various computer vision tasks.
ViT requires a large amount of data for pre-training.
We introduce DeiT-LT to tackle the problem of training ViTs from scratch on long-tailed datasets.
arXiv Detail & Related papers (2024-04-03T17:58:21Z) - SERE: Exploring Feature Self-relation for Self-supervised Transformer [79.5769147071757]
Vision transformers (ViT) have strong representation ability with spatial self-attention and channel-level feedforward networks.
Recent works reveal that self-supervised learning helps unleash the great potential of ViT.
We observe that relational modeling on spatial and channel dimensions distinguishes ViT from other networks.
arXiv Detail & Related papers (2022-06-10T15:25:00Z) - Auto-scaling Vision Transformers without Training [84.34662535276898]
We propose As-ViT, an auto-scaling framework for Vision Transformers (ViTs) without training.
As-ViT automatically discovers and scales up ViTs in an efficient and principled manner.
As a unified framework, As-ViT achieves strong performance on classification and detection.
arXiv Detail & Related papers (2022-02-24T06:30:55Z) - SPViT: Enabling Faster Vision Transformers via Soft Token Pruning [38.10083471492964]
Pruning, a traditional model compression paradigm for hardware efficiency, has been widely applied in various DNN structures.
We propose a computation-aware soft pruning framework, which can be set up on vanilla Transformers of both flatten and CNN-type structures.
Our framework significantly reduces the computation cost of ViTs while maintaining comparable performance on image classification.
arXiv Detail & Related papers (2021-12-27T20:15:25Z) - On Improving Adversarial Transferability of Vision Transformers [97.17154635766578]
Vision transformers (ViTs) process input images as sequences of patches via self-attention.
We study the adversarial feature space of ViT models and their transferability.
We introduce two novel strategies specific to the architecture of ViT models.
arXiv Detail & Related papers (2021-06-08T08:20:38Z) - Emerging Properties in Self-Supervised Vision Transformers [57.36837447500544]
We show that self-supervised ViTs provide new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets)
We implement our findings into a simple self-supervised method, called DINO, which we interpret as a form of self-distillation with no labels.
We show the synergy between DINO and ViTs by achieving 80.1% top-1 on ImageNet in linear evaluation with ViT-Base.
arXiv Detail & Related papers (2021-04-29T12:28:51Z) - DeepViT: Towards Deeper Vision Transformer [92.04063170357426]
Vision transformers (ViTs) have been successfully applied in image classification tasks recently.
We show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper.
We propose a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity.
arXiv Detail & Related papers (2021-03-22T14:32:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.