Equi-ViT: Rotational Equivariant Vision Transformer for Robust Histopathology Analysis
- URL: http://arxiv.org/abs/2601.09130v1
- Date: Wed, 14 Jan 2026 04:03:20 GMT
- Title: Equi-ViT: Rotational Equivariant Vision Transformer for Robust Histopathology Analysis
- Authors: Fuyao Chen, Yuexi Du, Elèonore V. Lieffrig, Nicha C. Dvornek, John A. Onofrey,
- Abstract summary: We propose Equi-ViT, which integrates an equivariant convolution kernel into the patch embedding stage of a ViT architecture.<n>We show that Equi-ViT achieves superior rotation-consistent patch embeddings and stable classification performance across image orientations.
- Score: 4.388994056961038
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision Transformers (ViTs) have gained rapid adoption in computational pathology for their ability to model long-range dependencies through self-attention, addressing the limitations of convolutional neural networks that excel at local pattern capture but struggle with global contextual reasoning. Recent pathology-specific foundation models have further advanced performance by leveraging large-scale pretraining. However, standard ViTs remain inherently non-equivariant to transformations such as rotations and reflections, which are ubiquitous variations in histopathology imaging. To address this limitation, we propose Equi-ViT, which integrates an equivariant convolution kernel into the patch embedding stage of a ViT architecture, imparting built-in rotational equivariance to learned representations. Equi-ViT achieves superior rotation-consistent patch embeddings and stable classification performance across image orientations. Our results on a public colorectal cancer dataset demonstrate that incorporating equivariant patch embedding enhances data efficiency and robustness, suggesting that equivariant transformers could potentially serve as more generalizable backbones for the application of ViT in histopathology, such as digital pathology foundation models.
Related papers
- Vanilla Group Equivariant Vision Transformer: Simple and Effective [74.55314825243444]
We propose a framework that renders key ViT components, including patch embedding, self-attention, positional encodings, and Down/Up-Sampling, equivariant.<n>Our equivariant ViTs consistently improve performance and data efficiency across a wide spectrum of vision tasks.
arXiv Detail & Related papers (2026-02-08T16:32:48Z) - Interpreting vision transformers via residual replacement model [8.97847158738423]
How do vision transformers (ViTs) represent and process the world?<n>This paper addresses this long-standing question through the first systematic analysis of 6.6K features across all layers.<n>We introduce the residual replacement model, which replaces ViT computations with interpretable features in the residual stream.
arXiv Detail & Related papers (2025-09-22T07:00:57Z) - Set Transformer Architectures and Synthetic Data Generation for Flow-Guided Nanoscale Localization [13.521075124606973]
Flow-guided localization (FGL) enables the identification of spatial regions within the human body that contain an event of diagnostic interest.<n>Existing FGL solutions rely on graph models with fixed topologies or handcrafted features, which limit their adaptability to anatomical variability and hinder scalability.<n>Our formulation treats nanodevices' circulation time reports as unordered sets, enabling permutation-invariant, variable-length input processing without relying on spatial priors.
arXiv Detail & Related papers (2025-08-22T08:22:25Z) - Embedding Radiomics into Vision Transformers for Multimodal Medical Image Classification [10.627136212959396]
Vision Transformers (ViTs) offer a powerful alternative to convolutional models by modeling long-range dependencies through self-attention.<n>We propose the Radiomics-Embedded Vision Transformer (RE-ViT) that combines radiomic features with data-driven visual embeddings within a ViT backbone.
arXiv Detail & Related papers (2025-04-15T06:55:58Z) - Generalized Face Forgery Detection via Adaptive Learning for Pre-trained Vision Transformer [54.32283739486781]
We present a textbfForgery-aware textbfAdaptive textbfVision textbfTransformer (FA-ViT) under the adaptive learning paradigm.
FA-ViT achieves 93.83% and 78.32% AUC scores on Celeb-DF and DFDC datasets in the cross-dataset evaluation.
arXiv Detail & Related papers (2023-09-20T06:51:11Z) - Adaptive Transformers for Robust Few-shot Cross-domain Face
Anti-spoofing [71.06718651013965]
We present adaptive vision transformers (ViT) for robust cross-domain face antispoofing.
We adopt ViT as a backbone to exploit its strength to account for long-range dependencies among pixels.
Experiments on several benchmark datasets show that the proposed models achieve both robust and competitive performance.
arXiv Detail & Related papers (2022-03-23T03:37:44Z) - Topographic VAEs learn Equivariant Capsules [84.33745072274942]
We introduce the Topographic VAE: a novel method for efficiently training deep generative models with topographically organized latent variables.
We show that such a model indeed learns to organize its activations according to salient characteristics such as digit class, width, and style on MNIST.
We demonstrate approximate equivariance to complex transformations, expanding upon the capabilities of existing group equivariant neural networks.
arXiv Detail & Related papers (2021-09-03T09:25:57Z) - Intriguing Properties of Vision Transformers [114.28522466830374]
Vision transformers (ViT) have demonstrated impressive performance across various machine vision problems.
We systematically study this question via an extensive set of experiments and comparisons with a high-performing convolutional neural network (CNN)
We show effective features of ViTs are due to flexible receptive and dynamic fields possible via the self-attention mechanism.
arXiv Detail & Related papers (2021-05-21T17:59:18Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z) - Visformer: The Vision-friendly Transformer [105.52122194322592]
We propose a new architecture named Visformer, which is abbreviated from the Vision-friendly Transformer'
With the same computational complexity, Visformer outperforms both the Transformer-based and convolution-based models in terms of ImageNet classification accuracy.
arXiv Detail & Related papers (2021-04-26T13:13:03Z) - Roto-Translation Equivariant Convolutional Networks: Application to
Histopathology Image Analysis [11.568329857588099]
We propose a framework to encode the geometric structure of the special Euclidean motion group SE(2) in convolutional networks.
We show that consistent increase of performances can be achieved when using the proposed framework.
arXiv Detail & Related papers (2020-02-20T13:44:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.