Reviving Shift Equivariance in Vision Transformers
- URL: http://arxiv.org/abs/2306.07470v1
- Date: Tue, 13 Jun 2023 00:13:11 GMT
- Title: Reviving Shift Equivariance in Vision Transformers
- Authors: Peijian Ding, Davit Soselia, Thomas Armstrong, Jiahao Su, and Furong
Huang
- Abstract summary: We propose an adaptive polyphase anchoring algorithm that can be seamlessly integrated into vision transformer models.
Our algorithms enable ViT, and its variants such as Twins to achieve 100% consistency with respect to input shift.
- Score: 12.720600348466498
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Shift equivariance is a fundamental principle that governs how we perceive
the world - our recognition of an object remains invariant with respect to
shifts. Transformers have gained immense popularity due to their effectiveness
in both language and vision tasks. While the self-attention operator in vision
transformers (ViT) is permutation-equivariant and thus shift-equivariant, patch
embedding, positional encoding, and subsampled attention in ViT variants can
disrupt this property, resulting in inconsistent predictions even under small
shift perturbations. Although there is a growing trend in incorporating the
inductive bias of convolutional neural networks (CNNs) into vision
transformers, it does not fully address the issue. We propose an adaptive
polyphase anchoring algorithm that can be seamlessly integrated into vision
transformer models to ensure shift-equivariance in patch embedding and
subsampled attention modules, such as window attention and global subsampled
attention. Furthermore, we utilize depth-wise convolution to encode positional
information. Our algorithms enable ViT, and its variants such as Twins to
achieve 100% consistency with respect to input shift, demonstrate robustness to
cropping, flipping, and affine transformations, and maintain consistent
predictions even when the original models lose 20 percentage points on average
when shifted by just a few pixels with Twins' accuracy dropping from 80.57% to
62.40%.
Related papers
- Multi-Dimensional Hyena for Spatial Inductive Bias [69.3021852589771]
We present a data-efficient vision transformer that does not rely on self-attention.
Instead, it employs a novel generalization to multiple axes of the very recent Hyena layer.
We show that a hybrid approach that is based on Hyena N-D for the first layers in ViT, followed by layers that incorporate conventional attention, consistently boosts the performance of various vision transformer architectures.
arXiv Detail & Related papers (2023-09-24T10:22:35Z) - 2-D SSM: A General Spatial Layer for Visual Transformers [79.4957965474334]
A central objective in computer vision is to design models with appropriate 2-D inductive bias.
We leverage an expressive variation of the multidimensional State Space Model.
Our approach introduces efficient parameterization, accelerated computation, and a suitable normalization scheme.
arXiv Detail & Related papers (2023-06-11T09:41:37Z) - Making Vision Transformers Truly Shift-Equivariant [20.61570323513044]
Vision Transformers (ViTs) have become one of the go-to deep net architectures for computer vision.
We introduce novel data-adaptive designs for each of the modules in ViTs, such as tokenization, self-attention, patch merging, and positional encoding.
We evaluate the proposed adaptive models on image classification and semantic segmentation tasks.
arXiv Detail & Related papers (2023-05-25T17:59:40Z) - Towards Lightweight Transformer via Group-wise Transformation for
Vision-and-Language Tasks [126.33843752332139]
We introduce Group-wise Transformation towards a universal yet lightweight Transformer for vision-and-language tasks, termed as LW-Transformer.
We apply LW-Transformer to a set of Transformer-based networks, and quantitatively measure them on three vision-and-language tasks and six benchmark datasets.
Experimental results show that while saving a large number of parameters and computations, LW-Transformer achieves very competitive performance against the original Transformer networks for vision-and-language tasks.
arXiv Detail & Related papers (2022-04-16T11:30:26Z) - Iwin: Human-Object Interaction Detection via Transformer with Irregular
Windows [57.00864538284686]
Iwin Transformer is a hierarchical Transformer which progressively performs token representation learning and token agglomeration within irregular windows.
The effectiveness and efficiency of Iwin Transformer are verified on the two standard HOI detection benchmark datasets.
arXiv Detail & Related papers (2022-03-20T12:04:50Z) - AdaViT: Adaptive Tokens for Efficient Vision Transformer [91.88404546243113]
We introduce AdaViT, a method that adaptively adjusts the inference cost of vision transformer (ViT) for images of different complexity.
AdaViT achieves this by automatically reducing the number of tokens in vision transformers that are processed in the network as inference proceeds.
arXiv Detail & Related papers (2021-12-14T18:56:07Z) - Understanding and Improving Robustness of Vision Transformers through
Patch-based Negative Augmentation [29.08732248577141]
We investigate the robustness of vision transformers (ViTs) through the lens of their special patch-based architectural structure.
We find that ViTs are surprisingly insensitive to patch-based transformations, even when the transformation largely destroys the original semantics.
We show that patch-based negative augmentation consistently improves robustness of ViTs across a wide set of ImageNet based robustness benchmarks.
arXiv Detail & Related papers (2021-10-15T04:53:18Z) - Vision Transformer with Progressive Sampling [73.60630716500154]
We propose an iterative and progressive sampling strategy to locate discriminative regions.
When trained from scratch on ImageNet, PS-ViT performs 3.8% higher than the vanilla ViT in terms of top-1 accuracy.
arXiv Detail & Related papers (2021-08-03T18:04:31Z) - Glance-and-Gaze Vision Transformer [13.77016463781053]
We propose a new vision Transformer, named Glance-and-Gaze Transformer (GG-Transformer)
It is motivated by the Glance and Gaze behavior of human beings when recognizing objects in natural scenes.
We empirically demonstrate our method achieves consistently superior performance over previous state-of-the-art Transformers.
arXiv Detail & Related papers (2021-06-04T06:13:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.