Related papers: VariViT: A Vision Transformer for Variable Image Sizes

VariViT: A Vision Transformer for Variable Image Sizes

URL: http://arxiv.org/abs/2602.14615v1
Date: Mon, 16 Feb 2026 10:20:46 GMT
Title: VariViT: A Vision Transformer for Variable Image Sizes
Authors: Aswathi Varma, Suprosanna Shit, Chinmay Prabhakar, Daniel Scholz, Hongwei Bran Li, Bjoern Menze, Daniel Rueckert, Benedikt Wiestler,
Abstract summary: Vision Transformers (ViTs) have emerged as the state-of-the-art architecture in representation learning.<n>ViTs split images into fixed-size patches, constraining them to a predefined size and necessitating pre-processing steps like resizing, padding, or cropping.<n>We propose VariViT, an improved ViT model crafted to handle variable image sizes while maintaining a consistent patch size.
Score: 19.721932776618964
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision Transformers (ViTs) have emerged as the state-of-the-art architecture in representation learning, leveraging self-attention mechanisms to excel in various tasks. ViTs split images into fixed-size patches, constraining them to a predefined size and necessitating pre-processing steps like resizing, padding, or cropping. This poses challenges in medical imaging, particularly with irregularly shaped structures like tumors. A fixed bounding box crop size produces input images with highly variable foreground-to-background ratios. Resizing medical images can degrade information and introduce artefacts, impacting diagnosis. Hence, tailoring variable-sized crops to regions of interest can enhance feature representation capabilities. Moreover, large images are computationally expensive, and smaller sizes risk information loss, presenting a computation-accuracy tradeoff. We propose VariViT, an improved ViT model crafted to handle variable image sizes while maintaining a consistent patch size. VariViT employs a novel positional embedding resizing scheme for a variable number of patches. We also implement a new batching strategy within VariViT to reduce computational complexity, resulting in faster training and inference times. In our evaluations on two 3D brain MRI datasets, VariViT surpasses vanilla ViTs and ResNet in glioma genotype prediction and brain tumor classification. It achieves F1-scores of 75.5% and 76.3%, respectively, learning more discriminative features. Our proposed batching strategy reduces computation time by up to 30% compared to conventional architectures. These findings underscore the efficacy of VariViT in image representation learning. Our code can be found here: https://github.com/Aswathi-Varma/varivit

Related papers

Effect of Patch Size on Fine-Tuning Vision Transformers in Two-Dimensional and Three-Dimensional Medical Image Classification [0.7916799079378047]
We evaluate how different patch sizes affect ViT classification performance.<n>We fine-tune ViT models and observe consistent improvements in classification performance with smaller patch sizes.<n>Our results indicate improvements in balanced accuracy of up to 12.78% for 2D datasets and up to 23.78% for 3D datasets.
arXiv Detail & Related papers (2026-02-20T21:07:59Z)
Accelerating Vision Transformers with Adaptive Patch Sizes [58.48800204993534]
Vision Transformers partition input images into uniformly sized patches regardless of their content.<n>We present Adaptive Patch Transformers (APT), which addresses this by using multiple different patch sizes within the same image.<n>APT achieves a drastic speedup in ViT inference and training, increasing throughput by 40% on ViT-L and 50% on ViT-H.
arXiv Detail & Related papers (2025-10-20T20:37:11Z)
Embedding Radiomics into Vision Transformers for Multimodal Medical Image Classification [10.627136212959396]
Vision Transformers (ViTs) offer a powerful alternative to convolutional models by modeling long-range dependencies through self-attention.<n>We propose the Radiomics-Embedded Vision Transformer (RE-ViT) that combines radiomic features with data-driven visual embeddings within a ViT backbone.
arXiv Detail & Related papers (2025-04-15T06:55:58Z)
AMIGO: Sparse Multi-Modal Graph Transformer with Shared-Context Processing for Representation Learning of Giga-pixel Images [53.29794593104923]
We present a novel concept of shared-context processing for whole slide histopathology images. AMIGO uses the celluar graph within the tissue to provide a single representation for a patient. We show that our model is strongly robust to missing information to an extent that it can achieve the same performance with as low as 20% of the data.
arXiv Detail & Related papers (2023-03-01T23:37:45Z)
Delving into Masked Autoencoders for Multi-Label Thorax Disease Classification [16.635426201975587]
Vision Transformer (ViT) has shown inferior performance to Convolutional Neural Network (CNN) on medical tasks due to its data-hungry nature and the lack of annotated medical data. In this paper, we pre-train ViTs on 266,340 chest X-rays using Masked Autoencoders (MAE) which reconstruct missing pixels from a small part of each image. The results show that our pre-trained ViT performs comparably (sometimes better) to the state-of-the-art CNN (DenseNet-121) for multi-label thorax disease classification.
arXiv Detail & Related papers (2022-10-23T20:14:57Z)
Accurate Image Restoration with Attention Retractable Transformer [50.05204240159985]
We propose Attention Retractable Transformer (ART) for image restoration. ART presents both dense and sparse attention modules in the network. We conduct extensive experiments on image super-resolution, denoising, and JPEG compression artifact reduction tasks.
arXiv Detail & Related papers (2022-10-04T07:35:01Z)
Conviformers: Convolutionally guided Vision Transformer [5.964436882344729]
We present an in-depth analysis and describe the critical components for developing a system for the fine-grained categorization of plants from herbarium sheets. We introduce a convolutional transformer architecture called Conviformer which, unlike the popular Vision Transformer (ConViT), can handle higher resolution images without exploding memory and computational cost. With our simple yet effective approach, we achieved SoTA on Herbarium 202x and iNaturalist 2019 dataset.
arXiv Detail & Related papers (2022-08-17T13:09:24Z)
PatchDropout: Economizing Vision Transformers Using Patch Dropout [9.243684409949436]
We show that standard ViT models can be efficiently trained at high resolution by randomly dropping input image patches. We observe a 5 times savings in computation and memory using PatchDropout, along with a boost in performance.
arXiv Detail & Related papers (2022-08-10T14:08:55Z)
DeepViT: Towards Deeper Vision Transformer [92.04063170357426]
Vision transformers (ViTs) have been successfully applied in image classification tasks recently. We show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to be deeper. We propose a simple yet effective method, named Re-attention, to re-generate the attention maps to increase their diversity.
arXiv Detail & Related papers (2021-03-22T14:32:07Z)
Scalable Visual Transformers with Hierarchical Pooling [61.05787583247392]
We propose a Hierarchical Visual Transformer (HVT) which progressively pools visual tokens to shrink the sequence length. It brings a great benefit by scaling dimensions of depth/width/resolution/patch size without introducing extra computational complexity. Our HVT outperforms the competitive baselines on ImageNet and CIFAR-100 datasets.
arXiv Detail & Related papers (2021-03-19T03:55:58Z)
Medical Transformer: Gated Axial-Attention for Medical Image Segmentation [73.98974074534497]
We study the feasibility of using Transformer-based network architectures for medical image segmentation tasks. We propose a Gated Axial-Attention model which extends the existing architectures by introducing an additional control mechanism in the self-attention module. To train the model effectively on medical images, we propose a Local-Global training strategy (LoGo) which further improves the performance.
arXiv Detail & Related papers (2021-02-21T18:35:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.