EIT: Efficiently Lead Inductive Biases to ViT
- URL: http://arxiv.org/abs/2203.07116v1
- Date: Mon, 14 Mar 2022 14:01:17 GMT
- Title: EIT: Efficiently Lead Inductive Biases to ViT
- Authors: Rui Xia, Jingchao Wang, Chao Xue, Boyu Deng, Fang Wang
- Abstract summary: Vision Transformer (ViT) depends on properties similar to the inductive bias inherent in Convolutional Neural Networks.
We propose an architecture called Efficiently lead Inductive biases to ViT (EIT), which can effectively lead the inductive biases to both phases of ViT.
In four popular small-scale datasets, compared with ViT, EIT has an accuracy improvement of 12.6% on average with fewer parameters and FLOPs.
- Score: 17.66805405320505
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision Transformer (ViT) depends on properties similar to the inductive bias
inherent in Convolutional Neural Networks to perform better on non-ultra-large
scale datasets. In this paper, we propose an architecture called Efficiently
lead Inductive biases to ViT (EIT), which can effectively lead the inductive
biases to both phases of ViT. In the Patches Projection phase, a convolutional
max-pooling structure is used to produce overlapping patches. In the
Transformer Encoder phase, we design a novel inductive bias introduction
structure called decreasing convolution, which is introduced parallel to the
multi-headed attention module, by which the embedding's different channels are
processed respectively. In four popular small-scale datasets, compared with
ViT, EIT has an accuracy improvement of 12.6% on average with fewer parameters
and FLOPs. Compared with ResNet, EIT exhibits higher accuracy with only 17.7%
parameters and fewer FLOPs. Finally, ablation studies show that the EIT is
efficient and does not require position embedding. Code is coming soon:
https://github.com/MrHaiPi/EIT
Related papers
- CARE Transformer: Mobile-Friendly Linear Visual Transformer via Decoupled Dual Interaction [77.8576094863446]
We propose a new detextbfCoupled dutextbfAl-interactive lineatextbfR atttextbfEntion (CARE) mechanism.
We first propose an asymmetrical feature decoupling strategy that asymmetrically decouples the learning process for local inductive bias and long-range dependencies.
By adopting a decoupled learning way and fully exploiting complementarity across features, our method can achieve both high efficiency and accuracy.
arXiv Detail & Related papers (2024-11-25T07:56:13Z) - Visual Fourier Prompt Tuning [63.66866445034855]
We propose the Visual Fourier Prompt Tuning (VFPT) method as a general and effective solution for adapting large-scale transformer-based models.
Our approach incorporates the Fast Fourier Transform into prompt embeddings and harmoniously considers both spatial and frequency domain information.
Our results demonstrate that our approach outperforms current state-of-the-art baselines on two benchmarks.
arXiv Detail & Related papers (2024-11-02T18:18:35Z) - Dynamic Tuning Towards Parameter and Inference Efficiency for ViT Adaptation [67.13876021157887]
Dynamic Tuning (DyT) is a novel approach to improve both parameter and inference efficiency for ViT adaptation.
DyT achieves superior performance compared to existing PEFT methods while evoking only 71% of their FLOPs on the VTAB-1K benchmark.
arXiv Detail & Related papers (2024-03-18T14:05:52Z) - Towards Flexible Inductive Bias via Progressive Reparameterization
Scheduling [25.76814731638375]
There are two de facto standard architectures in computer vision: Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs)
We show these approaches overlook that the optimal inductive bias also changes according to the target data scale changes.
The more convolution-like inductive bias is included in the model, the smaller the data scale is required where the ViT-like model outperforms the ResNet performance.
arXiv Detail & Related papers (2022-10-04T04:20:20Z) - LightViT: Towards Light-Weight Convolution-Free Vision Transformers [43.48734363817069]
Vision transformers (ViTs) are usually considered to be less light-weight than convolutional neural networks (CNNs)
We present LightViT as a new family of light-weight ViTs to achieve better accuracy-efficiency balance upon the pure transformer blocks without convolution.
Experiments show that our model achieves significant improvements on image classification, object detection, and semantic segmentation tasks.
arXiv Detail & Related papers (2022-07-12T14:27:57Z) - Global Vision Transformer Pruning with Hessian-Aware Saliency [93.33895899995224]
This work challenges the common design philosophy of the Vision Transformer (ViT) model with uniform dimension across all the stacked blocks in a model stage.
We derive a novel Hessian-based structural pruning criteria comparable across all layers and structures, with latency-aware regularization for direct latency reduction.
Performing iterative pruning on the DeiT-Base model leads to a new architecture family called NViT (Novel ViT), with a novel parameter that utilizes parameters more efficiently.
arXiv Detail & Related papers (2021-10-10T18:04:59Z) - When Vision Transformers Outperform ResNets without Pretraining or
Strong Data Augmentations [111.44860506703307]
Vision Transformers (ViTs) and existing VisionNets signal efforts on replacing hand-wired features or inductive throughputs with general-purpose neural architectures.
This paper investigates ViTs and Res-Mixers from the lens of loss geometry, intending to improve the models' data efficiency at training and inference.
We show that the improved robustness attributes to sparser active neurons in the first few layers.
The resultant ViTs outperform Nets of similar size and smoothness when trained from scratch on ImageNet without large-scale pretraining or strong data augmentations.
arXiv Detail & Related papers (2021-06-03T02:08:03Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.