$E(2)$-Equivariant Vision Transformer
- URL: http://arxiv.org/abs/2306.06722v3
- Date: Fri, 7 Jul 2023 06:59:26 GMT
- Title: $E(2)$-Equivariant Vision Transformer
- Authors: Renjun Xu and Kaifan Yang and Ke Liu and Fengxiang He
- Abstract summary: Vision Transformer (ViT) has achieved remarkable performance in computer vision.
positional encoding in ViT makes it substantially difficult to learn the intrinsic equivariance in data.
We design a Group Equivariant Vision Transformer (GE-ViT) via a novel, effective positional encoding operator.
- Score: 11.94180035256023
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision Transformer (ViT) has achieved remarkable performance in computer
vision. However, positional encoding in ViT makes it substantially difficult to
learn the intrinsic equivariance in data. Initial attempts have been made on
designing equivariant ViT but are proved defective in some cases in this paper.
To address this issue, we design a Group Equivariant Vision Transformer
(GE-ViT) via a novel, effective positional encoding operator. We prove that
GE-ViT meets all the theoretical requirements of an equivariant neural network.
Comprehensive experiments are conducted on standard benchmark datasets,
demonstrating that GE-ViT significantly outperforms non-equivariant
self-attention networks. The code is available at
https://github.com/ZJUCDSYangKaifan/GEVit.
Related papers
- ViTs are Everywhere: A Comprehensive Study Showcasing Vision
Transformers in Different Domain [0.0]
Vision Transformers (ViTs) are becoming more popular and dominant solutions for many vision problems.
ViTs can overcome several possible difficulties with convolutional neural networks (CNNs)
arXiv Detail & Related papers (2023-10-09T12:31:30Z) - PriViT: Vision Transformers for Fast Private Inference [55.36478271911595]
Vision Transformer (ViT) architecture has emerged as the backbone of choice for state-of-the-art deep models for computer vision applications.
ViTs are ill-suited for private inference using secure multi-party protocols, due to the large number of non-polynomial operations.
We propose PriViT, an algorithm to selectively " Taylorize" nonlinearities in ViTs while maintaining their prediction accuracy.
arXiv Detail & Related papers (2023-10-06T21:45:05Z) - Multi-Dimensional Hyena for Spatial Inductive Bias [69.3021852589771]
We present a data-efficient vision transformer that does not rely on self-attention.
Instead, it employs a novel generalization to multiple axes of the very recent Hyena layer.
We show that a hybrid approach that is based on Hyena N-D for the first layers in ViT, followed by layers that incorporate conventional attention, consistently boosts the performance of various vision transformer architectures.
arXiv Detail & Related papers (2023-09-24T10:22:35Z) - Making Vision Transformers Truly Shift-Equivariant [20.61570323513044]
Vision Transformers (ViTs) have become one of the go-to deep net architectures for computer vision.
We introduce novel data-adaptive designs for each of the modules in ViTs, such as tokenization, self-attention, patch merging, and positional encoding.
We evaluate the proposed adaptive models on image classification and semantic segmentation tasks.
arXiv Detail & Related papers (2023-05-25T17:59:40Z) - Self-Distilled Vision Transformer for Domain Generalization [58.76055100157651]
Vision transformers (ViTs) are challenging the supremacy of CNNs on standard benchmarks.
We propose a simple DG approach for ViTs, coined as self-distillation for ViTs.
We empirically demonstrate notable performance gains with different DG baselines and various ViT backbones in five challenging datasets.
arXiv Detail & Related papers (2022-07-25T17:57:05Z) - Vision Transformer Adapter for Dense Predictions [57.590511173416445]
Vision Transformer (ViT) achieves inferior performance on dense prediction tasks due to lacking prior information of images.
We propose a Vision Transformer Adapter (ViT-Adapter) which can remedy the defects of ViT and achieve comparable performance to vision-specific models.
We verify the effectiveness of our ViT-Adapter on multiple downstream tasks, including object detection, instance segmentation, and semantic segmentation.
arXiv Detail & Related papers (2022-05-17T17:59:11Z) - Discrete Representations Strengthen Vision Transformer Robustness [43.821734467553554]
Vision Transformer (ViT) is emerging as the state-of-the-art architecture for image recognition.
We present a simple and effective architecture modification to ViT's input layer by adding discrete tokens produced by a vector-quantized encoder.
Experimental results demonstrate that adding discrete representation on four architecture variants strengthens ViT robustness by up to 12% across seven ImageNet robustness benchmarks.
arXiv Detail & Related papers (2021-11-20T01:49:56Z) - Vision Transformer with Progressive Sampling [73.60630716500154]
We propose an iterative and progressive sampling strategy to locate discriminative regions.
When trained from scratch on ImageNet, PS-ViT performs 3.8% higher than the vanilla ViT in terms of top-1 accuracy.
arXiv Detail & Related papers (2021-08-03T18:04:31Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z) - CvT: Introducing Convolutions to Vision Transformers [44.74550305869089]
Convolutional vision Transformer (CvT) improves Vision Transformer (ViT) in performance and efficiency.
New architecture introduces convolutions into ViT to yield the best of both designs.
arXiv Detail & Related papers (2021-03-29T17:58:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.