LightViT: Towards Light-Weight Convolution-Free Vision Transformers
- URL: http://arxiv.org/abs/2207.05557v1
- Date: Tue, 12 Jul 2022 14:27:57 GMT
- Title: LightViT: Towards Light-Weight Convolution-Free Vision Transformers
- Authors: Tao Huang, Lang Huang, Shan You, Fei Wang, Chen Qian, Chang Xu
- Abstract summary: Vision transformers (ViTs) are usually considered to be less light-weight than convolutional neural networks (CNNs)
We present LightViT as a new family of light-weight ViTs to achieve better accuracy-efficiency balance upon the pure transformer blocks without convolution.
Experiments show that our model achieves significant improvements on image classification, object detection, and semantic segmentation tasks.
- Score: 43.48734363817069
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision transformers (ViTs) are usually considered to be less light-weight
than convolutional neural networks (CNNs) due to the lack of inductive bias.
Recent works thus resort to convolutions as a plug-and-play module and embed
them in various ViT counterparts. In this paper, we argue that the
convolutional kernels perform information aggregation to connect all tokens;
however, they would be actually unnecessary for light-weight ViTs if this
explicit aggregation could function in a more homogeneous way. Inspired by
this, we present LightViT as a new family of light-weight ViTs to achieve
better accuracy-efficiency balance upon the pure transformer blocks without
convolution. Concretely, we introduce a global yet efficient aggregation scheme
into both self-attention and feed-forward network (FFN) of ViTs, where
additional learnable tokens are introduced to capture global dependencies; and
bi-dimensional channel and spatial attentions are imposed over token
embeddings. Experiments show that our model achieves significant improvements
on image classification, object detection, and semantic segmentation tasks. For
example, our LightViT-T achieves 78.7% accuracy on ImageNet with only 0.7G
FLOPs, outperforming PVTv2-B0 by 8.2% while 11% faster on GPU. Code is
available at https://github.com/hunto/LightViT.
Related papers
- Depth-Wise Convolutions in Vision Transformers for Efficient Training on Small Datasets [11.95214938154427]
Vision Transformer (ViT) captures global information by dividing images into patches.
ViT lacks inductive bias during image or video dataset training.
We present a lightweight Depth-Wise Convolution module as a shortcut in ViT models.
arXiv Detail & Related papers (2024-07-28T04:23:40Z) - CAIT: Triple-Win Compression towards High Accuracy, Fast Inference, and
Favorable Transferability For ViTs [79.54107547233625]
Vision Transformers (ViTs) have emerged as state-of-the-art models for various vision tasks.
We propose a joint compression method for ViTs that offers both high accuracy and fast inference speed.
Our proposed method can achieve state-of-the-art performance across various ViTs.
arXiv Detail & Related papers (2023-09-27T16:12:07Z) - Making Vision Transformers Efficient from A Token Sparsification View [26.42498120556985]
We propose a novel Semantic Token ViT (STViT) for efficient global and local vision transformers.
Our method can achieve competitive results compared to the original networks in object detection and instance segmentation, with over 30% FLOPs reduction for backbone.
In addition, we design a STViT-R(ecover) network to restore the detailed spatial information based on the STViT, making it work for downstream tasks.
arXiv Detail & Related papers (2023-03-15T15:12:36Z) - Bridging the Gap Between Vision Transformers and Convolutional Neural
Networks on Small Datasets [91.25055890980084]
There still remains an extreme performance gap between Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) when training from scratch on small datasets.
We propose Dynamic Hybrid Vision Transformer (DHVT) as the solution to enhance the two inductive biases.
Our DHVT achieves a series of state-of-the-art performance with a lightweight model, 85.68% on CIFAR-100 with 22.8M parameters, 82.3% on ImageNet-1K with 24.0M parameters.
arXiv Detail & Related papers (2022-10-12T06:54:39Z) - EIT: Efficiently Lead Inductive Biases to ViT [17.66805405320505]
Vision Transformer (ViT) depends on properties similar to the inductive bias inherent in Convolutional Neural Networks.
We propose an architecture called Efficiently lead Inductive biases to ViT (EIT), which can effectively lead the inductive biases to both phases of ViT.
In four popular small-scale datasets, compared with ViT, EIT has an accuracy improvement of 12.6% on average with fewer parameters and FLOPs.
arXiv Detail & Related papers (2022-03-14T14:01:17Z) - Anti-Oversmoothing in Deep Vision Transformers via the Fourier Domain
Analysis: From Theory to Practice [111.47461527901318]
Vision Transformer (ViT) has recently demonstrated promise in computer vision problems.
ViT saturates quickly with depth increasing, due to the observed attention collapse or patch uniformity.
We propose two techniques to mitigate the undesirable low-pass limitation.
arXiv Detail & Related papers (2022-03-09T23:55:24Z) - Self-slimmed Vision Transformer [52.67243496139175]
Vision transformers (ViTs) have become the popular structures and outperformed convolutional neural networks (CNNs) on various vision tasks.
We propose a generic self-slimmed learning approach for vanilla ViTs, namely SiT.
Specifically, we first design a novel Token Slimming Module (TSM), which can boost the inference efficiency of ViTs.
arXiv Detail & Related papers (2021-11-24T16:48:57Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.