Vanilla Group Equivariant Vision Transformer: Simple and Effective
- URL: http://arxiv.org/abs/2602.08047v1
- Date: Sun, 08 Feb 2026 16:32:48 GMT
- Title: Vanilla Group Equivariant Vision Transformer: Simple and Effective
- Authors: Jiahong Fu, Qi Xie, Deyu Meng, Zongben Xu,
- Abstract summary: We propose a framework that renders key ViT components, including patch embedding, self-attention, positional encodings, and Down/Up-Sampling, equivariant.<n>Our equivariant ViTs consistently improve performance and data efficiency across a wide spectrum of vision tasks.
- Score: 74.55314825243444
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Incorporating symmetry priors as inductive biases to design equivariant Vision Transformers (ViTs) has emerged as a promising avenue for enhancing their performance. However, existing equivariant ViTs often struggle to balance performance with equivariance, primarily due to the challenge of achieving holistic equivariant modifications across the diverse modules in ViTs-particularly in harmonizing the Self-Attention mechanism with Patch Embedding. To address this, we propose a straightforward framework that systematically renders key ViT components, including patch embedding, self-attention, positional encodings, and Down/Up-Sampling, equivariant, thereby constructing ViTs with guaranteed equivariance. The resulting architecture serves as a plug-and-play replacement that is both theoretically grounded and practically versatile, scaling seamlessly even to Swin Transformers. Extensive experiments demonstrate that our equivariant ViTs consistently improve performance and data efficiency across a wide spectrum of vision tasks.
Related papers
- Equi-ViT: Rotational Equivariant Vision Transformer for Robust Histopathology Analysis [4.388994056961038]
We propose Equi-ViT, which integrates an equivariant convolution kernel into the patch embedding stage of a ViT architecture.<n>We show that Equi-ViT achieves superior rotation-consistent patch embeddings and stable classification performance across image orientations.
arXiv Detail & Related papers (2026-01-14T04:03:20Z) - Slicing Vision Transformer for Flexible Inference [79.35046907288518]
We propose a general framework, named Scala, to enable a single network to represent multiple smaller ViTs.<n>S Scala achieves an average improvement of 1.6% on ImageNet-1K with fewer parameters.
arXiv Detail & Related papers (2024-12-06T05:31:42Z) - Multi-Dimensional Hyena for Spatial Inductive Bias [69.3021852589771]
We present a data-efficient vision transformer that does not rely on self-attention.
Instead, it employs a novel generalization to multiple axes of the very recent Hyena layer.
We show that a hybrid approach that is based on Hyena N-D for the first layers in ViT, followed by layers that incorporate conventional attention, consistently boosts the performance of various vision transformer architectures.
arXiv Detail & Related papers (2023-09-24T10:22:35Z) - $E(2)$-Equivariant Vision Transformer [11.94180035256023]
Vision Transformer (ViT) has achieved remarkable performance in computer vision.
positional encoding in ViT makes it substantially difficult to learn the intrinsic equivariance in data.
We design a Group Equivariant Vision Transformer (GE-ViT) via a novel, effective positional encoding operator.
arXiv Detail & Related papers (2023-06-11T16:48:03Z) - Making Vision Transformers Truly Shift-Equivariant [20.61570323513044]
Vision Transformers (ViTs) have become one of the go-to deep net architectures for computer vision.
We introduce novel data-adaptive designs for each of the modules in ViTs, such as tokenization, self-attention, patch merging, and positional encoding.
We evaluate the proposed adaptive models on image classification and semantic segmentation tasks.
arXiv Detail & Related papers (2023-05-25T17:59:40Z) - Towards Lightweight Transformer via Group-wise Transformation for
Vision-and-Language Tasks [126.33843752332139]
We introduce Group-wise Transformation towards a universal yet lightweight Transformer for vision-and-language tasks, termed as LW-Transformer.
We apply LW-Transformer to a set of Transformer-based networks, and quantitatively measure them on three vision-and-language tasks and six benchmark datasets.
Experimental results show that while saving a large number of parameters and computations, LW-Transformer achieves very competitive performance against the original Transformer networks for vision-and-language tasks.
arXiv Detail & Related papers (2022-04-16T11:30:26Z) - Adaptive Transformers for Robust Few-shot Cross-domain Face
Anti-spoofing [71.06718651013965]
We present adaptive vision transformers (ViT) for robust cross-domain face antispoofing.
We adopt ViT as a backbone to exploit its strength to account for long-range dependencies among pixels.
Experiments on several benchmark datasets show that the proposed models achieve both robust and competitive performance.
arXiv Detail & Related papers (2022-03-23T03:37:44Z) - Global Vision Transformer Pruning with Hessian-Aware Saliency [93.33895899995224]
This work challenges the common design philosophy of the Vision Transformer (ViT) model with uniform dimension across all the stacked blocks in a model stage.
We derive a novel Hessian-based structural pruning criteria comparable across all layers and structures, with latency-aware regularization for direct latency reduction.
Performing iterative pruning on the DeiT-Base model leads to a new architecture family called NViT (Novel ViT), with a novel parameter that utilizes parameters more efficiently.
arXiv Detail & Related papers (2021-10-10T18:04:59Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.