Accelerating Vision Transformers with Adaptive Patch Sizes
- URL: http://arxiv.org/abs/2510.18091v1
- Date: Mon, 20 Oct 2025 20:37:11 GMT
- Title: Accelerating Vision Transformers with Adaptive Patch Sizes
- Authors: Rohan Choudhury, JungEun Kim, Jinhyung Park, Eunho Yang, László A. Jeni, Kris M. Kitani,
- Abstract summary: Vision Transformers partition input images into uniformly sized patches regardless of their content.<n>We present Adaptive Patch Transformers (APT), which addresses this by using multiple different patch sizes within the same image.<n>APT achieves a drastic speedup in ViT inference and training, increasing throughput by 40% on ViT-L and 50% on ViT-H.
- Score: 58.48800204993534
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision Transformers (ViTs) partition input images into uniformly sized patches regardless of their content, resulting in long input sequence lengths for high-resolution images. We present Adaptive Patch Transformers (APT), which addresses this by using multiple different patch sizes within the same image. APT reduces the total number of input tokens by allocating larger patch sizes in more homogeneous areas and smaller patches in more complex ones. APT achieves a drastic speedup in ViT inference and training, increasing throughput by 40% on ViT-L and 50% on ViT-H while maintaining downstream performance, and can be applied to a previously fine-tuned ViT, converging in as little as 1 epoch. It also significantly reduces training and inference time without loss of performance in high-resolution dense visual tasks, achieving up to 30\% faster training and inference in visual QA, object detection, and semantic segmentation.
Related papers
- Charm: The Missing Piece in ViT fine-tuning for Image Aesthetic Assessment [36.633379840639314]
Vision transformers (ViTs) are typically trained on small, fixed-size images obtained through downscaling or cropping.<n>We introduce Charm, a novel tokenization approach that preserves Composition, High-resolution, Aspect Ratio, and Multi-scale information simultaneously.<n>Charm improves ViT performance and generalizability for image aesthetic assessment.
arXiv Detail & Related papers (2025-04-03T12:19:04Z) - No Token Left Behind: Efficient Vision Transformer via Dynamic Token
Idling [55.203866875294516]
Vision Transformers (ViTs) have demonstrated outstanding performance in computer vision tasks.
Various token pruning techniques have been introduced to alleviate the high computational burden of ViTs.
We propose IdleViT, a dynamic token-idle-based method that achieves an excellent trade-off between performance and efficiency.
arXiv Detail & Related papers (2023-10-09T12:10:41Z) - CAIT: Triple-Win Compression towards High Accuracy, Fast Inference, and Favorable Transferability For ViTs [89.79139531731637]
Vision Transformers (ViTs) have emerged as state-of-the-art models for various vision tasks.<n>We propose a joint underlinecompression method for ViTs that achieves a harmonious blend of high underlineaccuracy, fast underlineinference speed, and favorable underlinetransferability to downstream tasks.
arXiv Detail & Related papers (2023-09-27T16:12:07Z) - FlexiViT: One Model for All Patch Sizes [100.52574011880571]
Vision Transformers convert images to sequences by slicing them into patches.
The size of these patches controls a speed/accuracy tradeoff, with smaller patches leading to higher accuracy at greater computational cost.
We show that simply randomizing the patch size at training time leads to a single set of weights that performs well across a wide range of patch sizes.
arXiv Detail & Related papers (2022-12-15T18:18:38Z) - Accelerating Vision Transformer Training via a Patch Sampling Schedule [0.685316573653194]
We introduce the notion of a Patch Sampling Schedule (PSS)
PSS varies the number of Vision Transformer (ViT) patches used per batch during training.
We observe that training with a PSS makes a ViT more robust to a wider patch sampling range during inference.
arXiv Detail & Related papers (2022-08-19T19:16:46Z) - Coarse-to-Fine Vision Transformer [83.45020063642235]
We propose a coarse-to-fine vision transformer (CF-ViT) to relieve computational burden while retaining performance.
Our proposed CF-ViT is motivated by two important observations in modern ViT models.
Our CF-ViT reduces 53% FLOPs of LV-ViT, and also achieves 2.01x throughput.
arXiv Detail & Related papers (2022-03-08T02:57:49Z) - AdaViT: Adaptive Tokens for Efficient Vision Transformer [91.88404546243113]
We introduce AdaViT, a method that adaptively adjusts the inference cost of vision transformer (ViT) for images of different complexity.
AdaViT achieves this by automatically reducing the number of tokens in vision transformers that are processed in the network as inference proceeds.
arXiv Detail & Related papers (2021-12-14T18:56:07Z) - A Unified Pruning Framework for Vision Transformers [40.7622551128182]
Vision transformer (ViT) and its variants have achieved promising performances in various computer vision tasks.
We propose a unified framework for structural pruning of both ViTs and its variants, namely UP-ViTs.
Our method focuses on pruning all ViTs components while maintaining the consistency of the model structure.
arXiv Detail & Related papers (2021-11-30T05:01:02Z) - Understanding and Improving Robustness of Vision Transformers through
Patch-based Negative Augmentation [29.08732248577141]
We investigate the robustness of vision transformers (ViTs) through the lens of their special patch-based architectural structure.
We find that ViTs are surprisingly insensitive to patch-based transformations, even when the transformation largely destroys the original semantics.
We show that patch-based negative augmentation consistently improves robustness of ViTs across a wide set of ImageNet based robustness benchmarks.
arXiv Detail & Related papers (2021-10-15T04:53:18Z) - Scalable Visual Transformers with Hierarchical Pooling [61.05787583247392]
We propose a Hierarchical Visual Transformer (HVT) which progressively pools visual tokens to shrink the sequence length.
It brings a great benefit by scaling dimensions of depth/width/resolution/patch size without introducing extra computational complexity.
Our HVT outperforms the competitive baselines on ImageNet and CIFAR-100 datasets.
arXiv Detail & Related papers (2021-03-19T03:55:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.