PatchDropout: Economizing Vision Transformers Using Patch Dropout
- URL: http://arxiv.org/abs/2208.07220v1
- Date: Wed, 10 Aug 2022 14:08:55 GMT
- Title: PatchDropout: Economizing Vision Transformers Using Patch Dropout
- Authors: Yue Liu, Christos Matsoukas, Fredrik Strand, Hossein Azizpour, Kevin
Smith
- Abstract summary: We show that standard ViT models can be efficiently trained at high resolution by randomly dropping input image patches.
We observe a 5 times savings in computation and memory using PatchDropout, along with a boost in performance.
- Score: 9.243684409949436
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision transformers have demonstrated the potential to outperform CNNs in a
variety of vision tasks. But the computational and memory requirements of these
models prohibit their use in many applications, especially those that depend on
high-resolution images, such as medical image classification. Efforts to train
ViTs more efficiently are overly complicated, necessitating architectural
changes or intricate training schemes. In this work, we show that standard ViT
models can be efficiently trained at high resolution by randomly dropping input
image patches. This simple approach, PatchDropout, reduces FLOPs and memory by
at least 50% in standard natural image datasets such as ImageNet, and those
savings only increase with image size. On CSAW, a high-resolution medical
dataset, we observe a 5 times savings in computation and memory using
PatchDropout, along with a boost in performance. For practitioners with a fixed
computational or memory budget, PatchDropout makes it possible to choose image
resolution, hyperparameters, or model size to get the most performance out of
their model.
Related papers
- ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models [77.59651787115546]
High-resolution Large Multimodal Models (LMMs) encounter the challenges of excessive visual tokens and quadratic visual complexity.
We propose ConvLLaVA, which employs ConvNeXt, a hierarchical backbone, as the visual encoder of LMM.
ConvLLaVA compresses high-resolution images into information-rich visual features, effectively preventing the generation of excessive visual tokens.
arXiv Detail & Related papers (2024-05-24T17:34:15Z) - MeLo: Low-rank Adaptation is Better than Fine-tuning for Medical Image Diagnosis [63.59184480010552]
Vision Transformers (ViT) have become much larger and less accessible to medical imaging communities.
MeLo (Medical image Low-rank adaptation) adopts low-rank adaptation instead of resource-demanding fine-tuning.
Our proposed method achieves comparable performance to fully fine-tuned ViT models on four distinct medical imaging datasets.
arXiv Detail & Related papers (2023-11-14T15:18:54Z) - Reversible Vision Transformers [74.3500977090597]
Reversible Vision Transformers are a memory efficient architecture for visual recognition.
We adapt two popular models, namely Vision Transformer and Multiscale Vision Transformers, to reversible variants.
We find that the additional computational burden of recomputing activations is more than overcome for deeper models.
arXiv Detail & Related papers (2023-02-09T18:59:54Z) - Patch Gradient Descent: Training Neural Networks on Very Large Images [13.969180905165533]
We propose Patch Gradient Descent (PatchGD) to train existing CNN architectures on large-scale images.
PatchGD is based on the hypothesis that instead of performing gradient-based updates on an entire image at once, it should be possible to achieve a good solution by performing model updates on only small parts of the image.
Our evaluation shows that PatchGD is much more stable and efficient than the standard gradient-descent method in handling large images.
arXiv Detail & Related papers (2023-01-31T18:04:35Z) - FlexiViT: One Model for All Patch Sizes [100.52574011880571]
Vision Transformers convert images to sequences by slicing them into patches.
The size of these patches controls a speed/accuracy tradeoff, with smaller patches leading to higher accuracy at greater computational cost.
We show that simply randomizing the patch size at training time leads to a single set of weights that performs well across a wide range of patch sizes.
arXiv Detail & Related papers (2022-12-15T18:18:38Z) - Iterative Patch Selection for High-Resolution Image Recognition [10.847032625429717]
We propose a simple method, Iterative Patch Selection (IPS), which decouples the memory usage from the input size.
IPS achieves this by selecting only the most salient patches, which are then aggregated into a global representation for image recognition.
Our method demonstrates strong performance and has wide applicability across different domains, training regimes and image sizes while using minimal accelerator memory.
arXiv Detail & Related papers (2022-10-24T07:55:57Z) - Swin Transformer V2: Scaling Up Capacity and Resolution [45.462916348268664]
We present techniques for scaling Swin Transformer up to 3 billion parameters and making it capable of training with images of up to 1,536$times$1,536 resolution.
By scaling up capacity and resolution, Swin Transformer sets new records on four representative vision benchmarks.
arXiv Detail & Related papers (2021-11-18T18:59:33Z) - Restormer: Efficient Transformer for High-Resolution Image Restoration [118.9617735769827]
convolutional neural networks (CNNs) perform well at learning generalizable image priors from large-scale data.
Transformers have shown significant performance gains on natural language and high-level vision tasks.
Our model, named Restoration Transformer (Restormer), achieves state-of-the-art results on several image restoration tasks.
arXiv Detail & Related papers (2021-11-18T18:59:10Z) - Patch Slimming for Efficient Vision Transformers [107.21146699082819]
We study the efficiency problem for visual transformers by excavating redundant calculation in given networks.
We present a novel patch slimming approach that discards useless patches in a top-down paradigm.
Experimental results on benchmark datasets demonstrate that the proposed method can significantly reduce the computational costs of vision transformers.
arXiv Detail & Related papers (2021-06-05T09:46:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.