Related papers: Effect of Patch Size on Fine-Tuning Vision Transformers in Two-Dimensional and Three-Dimensional Medical Image Classification

Effect of Patch Size on Fine-Tuning Vision Transformers in Two-Dimensional and Three-Dimensional Medical Image Classification

URL: http://arxiv.org/abs/2602.18614v1
Date: Fri, 20 Feb 2026 21:07:59 GMT
Title: Effect of Patch Size on Fine-Tuning Vision Transformers in Two-Dimensional and Three-Dimensional Medical Image Classification
Authors: Massoud Dehghan, Ramona Woitek, Amirreza Mahbod,
Abstract summary: We evaluate how different patch sizes affect ViT classification performance.<n>We fine-tune ViT models and observe consistent improvements in classification performance with smaller patch sizes.<n>Our results indicate improvements in balanced accuracy of up to 12.78% for 2D datasets and up to 23.78% for 3D datasets.
Score: 0.7916799079378047
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision Transformers (ViTs) and their variants have become state-of-the-art in many computer vision tasks and are widely used as backbones in large-scale vision and vision-language foundation models. While substantial research has focused on architectural improvements, the impact of patch size, a crucial initial design choice in ViTs, remains underexplored, particularly in medical domains where both two-dimensional (2D) and three-dimensional (3D) imaging modalities exist. In this study, using 12 medical imaging datasets from various imaging modalities (including seven 2D and five 3D datasets), we conduct a thorough evaluation of how different patch sizes affect ViT classification performance. Using a single graphical processing unit (GPU) and a range of patch sizes (1, 2, 4, 7, 14, 28), we fine-tune ViT models and observe consistent improvements in classification performance with smaller patch sizes (1, 2, and 4), which achieve the best results across nearly all datasets. More specifically, our results indicate improvements in balanced accuracy of up to 12.78% for 2D datasets (patch size 2 vs. 28) and up to 23.78% for 3D datasets (patch size 1 vs. 14), at the cost of increased computational expense. Moreover, by applying a straightforward ensemble strategy that fuses the predictions of the models trained with patch sizes 1, 2, and 4, we demonstrate a further boost in performance in most cases, especially for the 2D datasets. Our implementation is publicly available on GitHub: https://github.com/HealMaDe/MedViT

Related papers

VariViT: A Vision Transformer for Variable Image Sizes [19.721932776618964]
Vision Transformers (ViTs) have emerged as the state-of-the-art architecture in representation learning.<n>ViTs split images into fixed-size patches, constraining them to a predefined size and necessitating pre-processing steps like resizing, padding, or cropping.<n>We propose VariViT, an improved ViT model crafted to handle variable image sizes while maintaining a consistent patch size.
arXiv Detail & Related papers (2026-02-16T10:20:46Z)
DART: Differentiable Dynamic Adaptive Region Tokenizer for Vision Foundation Models [45.12546316524245]
We introduce DART, a fully differentiable Dynamic Region Adaptive Tokenizer.<n>DART employs learnable region scores and quantile-based partitioning to create content-aware patches of varying sizes.<n>The impact of this approach is profound, where a DART-Small matches the performance of a DeiT-Base86 with nearly double the inference speed.
arXiv Detail & Related papers (2025-06-12T06:25:37Z)
Tri-Plane Mamba: Efficiently Adapting Segment Anything Model for 3D Medical Images [16.55283939924806]
General networks for 3D medical image segmentation have recently undergone extensive exploration. The emergence of the Segment Anything Model (SAM) has enabled this model to achieve superior performance in 2D medical image segmentation tasks. We present two major innovations: 1) multi-scale 3D convolutional adapters, optimized for efficiently processing local depth-level information, and 2) a tri-plane mamba module, engineered to capture long-range depth-level representation.
arXiv Detail & Related papers (2024-09-13T02:37:13Z)
Implicit-Zoo: A Large-Scale Dataset of Neural Implicit Functions for 2D Images and 3D Scenes [65.22070581594426]
"Implicit-Zoo" is a large-scale dataset requiring thousands of GPU training days to facilitate research and development in this field. We showcase two immediate benefits as it enables to: (1) learn token locations for transformer models; (2) directly regress 3D cameras poses of 2D images with respect to NeRF models. This in turn leads to an improved performance in all three task of image classification, semantic segmentation, and 3D pose regression, thereby unlocking new avenues for research.
arXiv Detail & Related papers (2024-06-25T10:20:44Z)
Diff3Dformer: Leveraging Slice Sequence Diffusion for Enhanced 3D CT Classification with Transformer Networks [5.806035963947936]
We propose a Diffusion-based 3D Vision Transformer (Diff3Dformer) to aggregate repetitive information within 3D CT scans. Our method exhibits improved performance on two different scales of small datasets of 3D lung CT scans.
arXiv Detail & Related papers (2024-06-24T23:23:18Z)
Leveraging Large-Scale Pretrained Vision Foundation Models for Label-Efficient 3D Point Cloud Segmentation [67.07112533415116]
We present a novel framework that adapts various foundational models for the 3D point cloud segmentation task. Our approach involves making initial predictions of 2D semantic masks using different large vision models. To generate robust 3D semantic pseudo labels, we introduce a semantic label fusion strategy that effectively combines all the results via voting.
arXiv Detail & Related papers (2023-11-03T15:41:15Z)
Video Pretraining Advances 3D Deep Learning on Chest CT Tasks [63.879848037679224]
Pretraining on large natural image classification datasets has aided model development on data-scarce 2D medical tasks. These 2D models have been surpassed by 3D models on 3D computer vision benchmarks. We show video pretraining for 3D models can enable higher performance on smaller datasets for 3D medical tasks.
arXiv Detail & Related papers (2023-04-02T14:46:58Z)
Adapting Pre-trained Vision Transformers from 2D to 3D through Weight Inflation Improves Medical Image Segmentation [19.693778706169752]
We use a weight inflation strategy to adapt pre-trained Transformers from 2D to 3D, retaining the benefit of both transfer learning and depth information. Our approach achieves state-of-the-art performances across a broad range of 3D medical image datasets.
arXiv Detail & Related papers (2023-02-08T19:38:13Z)
MedMNIST v2: A Large-Scale Lightweight Benchmark for 2D and 3D Biomedical Image Classification [59.10015984688104]
MedMNIST v2 is a large-scale MNIST-like dataset collection of standardized biomedical images. The resulting dataset consists of 708,069 2D images and 10,214 3D images in total.
arXiv Detail & Related papers (2021-10-27T22:02:04Z)
Efficient Self-supervised Vision Transformers for Representation Learning [86.57557009109411]
We show that multi-stage architectures with sparse self-attentions can significantly reduce modeling complexity. We propose a new pre-training task of region matching which allows the model to capture fine-grained region dependencies. Our results show that combining the two techniques, EsViT achieves 81.3% top-1 on the ImageNet linear probe evaluation.
arXiv Detail & Related papers (2021-06-17T19:57:33Z)
Synthetic Training for Monocular Human Mesh Recovery [100.38109761268639]
This paper aims to estimate 3D mesh of multiple body parts with large-scale differences from a single RGB image. The main challenge is lacking training data that have complete 3D annotations of all body parts in 2D images. We propose a depth-to-scale (D2S) projection to incorporate the depth difference into the projection function to derive per-joint scale variants.
arXiv Detail & Related papers (2020-10-27T03:31:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.