Making Vision Transformers Truly Shift-Equivariant
- URL: http://arxiv.org/abs/2305.16316v2
- Date: Tue, 28 Nov 2023 22:47:52 GMT
- Title: Making Vision Transformers Truly Shift-Equivariant
- Authors: Renan A. Rojas-Gomez, Teck-Yian Lim, Minh N. Do, Raymond A. Yeh
- Abstract summary: Vision Transformers (ViTs) have become one of the go-to deep net architectures for computer vision.
We introduce novel data-adaptive designs for each of the modules in ViTs, such as tokenization, self-attention, patch merging, and positional encoding.
We evaluate the proposed adaptive models on image classification and semantic segmentation tasks.
- Score: 20.61570323513044
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: For computer vision, Vision Transformers (ViTs) have become one of the go-to
deep net architectures. Despite being inspired by Convolutional Neural Networks
(CNNs), ViTs' output remains sensitive to small spatial shifts in the input,
i.e., not shift invariant. To address this shortcoming, we introduce novel
data-adaptive designs for each of the modules in ViTs, such as tokenization,
self-attention, patch merging, and positional encoding. With our proposed
modules, we achieve true shift-equivariance on four well-established ViTs,
namely, Swin, SwinV2, CvT, and MViTv2. Empirically, we evaluate the proposed
adaptive models on image classification and semantic segmentation tasks. These
models achieve competitive performance across three different datasets while
maintaining 100% shift consistency.
Related papers
- Reviving Shift Equivariance in Vision Transformers [12.720600348466498]
We propose an adaptive polyphase anchoring algorithm that can be seamlessly integrated into vision transformer models.
Our algorithms enable ViT, and its variants such as Twins to achieve 100% consistency with respect to input shift.
arXiv Detail & Related papers (2023-06-13T00:13:11Z) - $E(2)$-Equivariant Vision Transformer [11.94180035256023]
Vision Transformer (ViT) has achieved remarkable performance in computer vision.
positional encoding in ViT makes it substantially difficult to learn the intrinsic equivariance in data.
We design a Group Equivariant Vision Transformer (GE-ViT) via a novel, effective positional encoding operator.
arXiv Detail & Related papers (2023-06-11T16:48:03Z) - Holistically Explainable Vision Transformers [136.27303006772294]
We propose B-cos transformers, which inherently provide holistic explanations for their decisions.
Specifically, we formulate each model component - such as the multi-layer perceptrons, attention layers, and the tokenisation module - to be dynamic linear.
We apply our proposed design to Vision Transformers (ViTs) and show that the resulting models, dubbed Bcos-ViTs, are highly interpretable and perform competitively to baseline ViTs.
arXiv Detail & Related papers (2023-01-20T16:45:34Z) - Bridging the Gap Between Vision Transformers and Convolutional Neural
Networks on Small Datasets [91.25055890980084]
There still remains an extreme performance gap between Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) when training from scratch on small datasets.
We propose Dynamic Hybrid Vision Transformer (DHVT) as the solution to enhance the two inductive biases.
Our DHVT achieves a series of state-of-the-art performance with a lightweight model, 85.68% on CIFAR-100 with 22.8M parameters, 82.3% on ImageNet-1K with 24.0M parameters.
arXiv Detail & Related papers (2022-10-12T06:54:39Z) - AdaViT: Adaptive Tokens for Efficient Vision Transformer [91.88404546243113]
We introduce AdaViT, a method that adaptively adjusts the inference cost of vision transformer (ViT) for images of different complexity.
AdaViT achieves this by automatically reducing the number of tokens in vision transformers that are processed in the network as inference proceeds.
arXiv Detail & Related papers (2021-12-14T18:56:07Z) - Discrete Representations Strengthen Vision Transformer Robustness [43.821734467553554]
Vision Transformer (ViT) is emerging as the state-of-the-art architecture for image recognition.
We present a simple and effective architecture modification to ViT's input layer by adding discrete tokens produced by a vector-quantized encoder.
Experimental results demonstrate that adding discrete representation on four architecture variants strengthens ViT robustness by up to 12% across seven ImageNet robustness benchmarks.
arXiv Detail & Related papers (2021-11-20T01:49:56Z) - On Improving Adversarial Transferability of Vision Transformers [97.17154635766578]
Vision transformers (ViTs) process input images as sequences of patches via self-attention.
We study the adversarial feature space of ViT models and their transferability.
We introduce two novel strategies specific to the architecture of ViT models.
arXiv Detail & Related papers (2021-06-08T08:20:38Z) - ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias [76.16156833138038]
We propose a novel Vision Transformer Advanced by Exploring intrinsic IB from convolutions, ie, ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
In each transformer layer, ViTAE has a convolution block in parallel to the multi-head self-attention module, whose features are fused and fed into the feed-forward network.
arXiv Detail & Related papers (2021-06-07T05:31:06Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z) - CvT: Introducing Convolutions to Vision Transformers [44.74550305869089]
Convolutional vision Transformer (CvT) improves Vision Transformer (ViT) in performance and efficiency.
New architecture introduces convolutions into ViT to yield the best of both designs.
arXiv Detail & Related papers (2021-03-29T17:58:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.