EIT: Efficiently Lead Inductive Biases to ViT
- URL: http://arxiv.org/abs/2203.07116v1
- Date: Mon, 14 Mar 2022 14:01:17 GMT
- Title: EIT: Efficiently Lead Inductive Biases to ViT
- Authors: Rui Xia, Jingchao Wang, Chao Xue, Boyu Deng, Fang Wang
- Abstract summary: Vision Transformer (ViT) depends on properties similar to the inductive bias inherent in Convolutional Neural Networks.
We propose an architecture called Efficiently lead Inductive biases to ViT (EIT), which can effectively lead the inductive biases to both phases of ViT.
In four popular small-scale datasets, compared with ViT, EIT has an accuracy improvement of 12.6% on average with fewer parameters and FLOPs.
- Score: 17.66805405320505
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision Transformer (ViT) depends on properties similar to the inductive bias
inherent in Convolutional Neural Networks to perform better on non-ultra-large
scale datasets. In this paper, we propose an architecture called Efficiently
lead Inductive biases to ViT (EIT), which can effectively lead the inductive
biases to both phases of ViT. In the Patches Projection phase, a convolutional
max-pooling structure is used to produce overlapping patches. In the
Transformer Encoder phase, we design a novel inductive bias introduction
structure called decreasing convolution, which is introduced parallel to the
multi-headed attention module, by which the embedding's different channels are
processed respectively. In four popular small-scale datasets, compared with
ViT, EIT has an accuracy improvement of 12.6% on average with fewer parameters
and FLOPs. Compared with ResNet, EIT exhibits higher accuracy with only 17.7%
parameters and fewer FLOPs. Finally, ablation studies show that the EIT is
efficient and does not require position embedding. Code is coming soon:
https://github.com/MrHaiPi/EIT
Related papers
- Sparse-Tuning: Adapting Vision Transformers with Efficient Fine-tuning and Inference [12.088764810907968]
textbfSparse-Tuning is a novel tuning paradigm that substantially enhances both fine-tuning and inference efficiency for pre-trained ViT models.
Sparse-Tuning efficiently fine-tunes the pre-trained ViT by sparsely preserving the informative tokens and merging redundant ones.
arXiv Detail & Related papers (2024-05-23T15:34:53Z) - Dynamic Tuning Towards Parameter and Inference Efficiency for ViT Adaptation [67.13876021157887]
Dynamic Tuning (DyT) is a novel approach to improve both parameter and inference efficiency for ViT adaptation.
DyT achieves comparable or even superior performance compared to existing PEFT methods.
arXiv Detail & Related papers (2024-03-18T14:05:52Z) - Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer [56.87383229709899]
We develop an information rectification module (IRM) and a distribution guided distillation scheme for fully quantized vision transformers (Q-ViT)
Our method achieves a much better performance than the prior arts.
arXiv Detail & Related papers (2022-10-13T04:00:29Z) - Towards Flexible Inductive Bias via Progressive Reparameterization
Scheduling [25.76814731638375]
There are two de facto standard architectures in computer vision: Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs)
We show these approaches overlook that the optimal inductive bias also changes according to the target data scale changes.
The more convolution-like inductive bias is included in the model, the smaller the data scale is required where the ViT-like model outperforms the ResNet performance.
arXiv Detail & Related papers (2022-10-04T04:20:20Z) - LightViT: Towards Light-Weight Convolution-Free Vision Transformers [43.48734363817069]
Vision transformers (ViTs) are usually considered to be less light-weight than convolutional neural networks (CNNs)
We present LightViT as a new family of light-weight ViTs to achieve better accuracy-efficiency balance upon the pure transformer blocks without convolution.
Experiments show that our model achieves significant improvements on image classification, object detection, and semantic segmentation tasks.
arXiv Detail & Related papers (2022-07-12T14:27:57Z) - Global Vision Transformer Pruning with Hessian-Aware Saliency [93.33895899995224]
This work challenges the common design philosophy of the Vision Transformer (ViT) model with uniform dimension across all the stacked blocks in a model stage.
We derive a novel Hessian-based structural pruning criteria comparable across all layers and structures, with latency-aware regularization for direct latency reduction.
Performing iterative pruning on the DeiT-Base model leads to a new architecture family called NViT (Novel ViT), with a novel parameter that utilizes parameters more efficiently.
arXiv Detail & Related papers (2021-10-10T18:04:59Z) - When Vision Transformers Outperform ResNets without Pretraining or
Strong Data Augmentations [111.44860506703307]
Vision Transformers (ViTs) and existing VisionNets signal efforts on replacing hand-wired features or inductive throughputs with general-purpose neural architectures.
This paper investigates ViTs and Res-Mixers from the lens of loss geometry, intending to improve the models' data efficiency at training and inference.
We show that the improved robustness attributes to sparser active neurons in the first few layers.
The resultant ViTs outperform Nets of similar size and smoothness when trained from scratch on ImageNet without large-scale pretraining or strong data augmentations.
arXiv Detail & Related papers (2021-06-03T02:08:03Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z) - DeepViT: Towards Deeper Vision Transformer [92.04063170357426]
Vision transformers (ViTs) have been successfully applied in image classification tasks recently.
We show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper.
We propose a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity.
arXiv Detail & Related papers (2021-03-22T14:32:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.