Related papers: PatchDropout: Economizing Vision Transformers Using Patch Dropout

PatchDropout: Economizing Vision Transformers Using Patch Dropout

URL: http://arxiv.org/abs/2208.07220v1
Date: Wed, 10 Aug 2022 14:08:55 GMT
Title: PatchDropout: Economizing Vision Transformers Using Patch Dropout
Authors: Yue Liu, Christos Matsoukas, Fredrik Strand, Hossein Azizpour, Kevin Smith
Abstract summary: We show that standard ViT models can be efficiently trained at high resolution by randomly dropping input image patches. We observe a 5 times savings in computation and memory using PatchDropout, along with a boost in performance.
Score: 9.243684409949436
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision transformers have demonstrated the potential to outperform CNNs in a variety of vision tasks. But the computational and memory requirements of these models prohibit their use in many applications, especially those that depend on high-resolution images, such as medical image classification. Efforts to train ViTs more efficiently are overly complicated, necessitating architectural changes or intricate training schemes. In this work, we show that standard ViT models can be efficiently trained at high resolution by randomly dropping input image patches. This simple approach, PatchDropout, reduces FLOPs and memory by at least 50% in standard natural image datasets such as ImageNet, and those savings only increase with image size. On CSAW, a high-resolution medical dataset, we observe a 5 times savings in computation and memory using PatchDropout, along with a boost in performance. For practitioners with a fixed computational or memory budget, PatchDropout makes it possible to choose image resolution, hyperparameters, or model size to get the most performance out of their model.

Related papers

ERUPT: Efficient Rendering with Unposed Patch Transformer [1.6715514162046485]
This work addresses the problem of novel view synthesis in diverse scenes from small collections of RGB images. We propose ERUPT, a state-of-the-art scene reconstruction model capable of efficient scene rendering using unposed imagery.
arXiv Detail & Related papers (2025-03-31T17:53:05Z)
Scaling Laws in Patchification: An Image Is Worth 50,176 Tokens And More [34.12661784331014]
We study the information loss caused by patchification-based compressive encoding paradigm. We conduct extensive patch size scaling experiments and excitedly observe an intriguing scaling law in patchification. As a by-product, we discover that with smaller patches, task-specific decoder heads become less critical for dense prediction.
arXiv Detail & Related papers (2025-02-06T03:01:38Z)
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models [77.59651787115546]
High-resolution Large Multimodal Models (LMMs) encounter the challenges of excessive visual tokens and quadratic visual complexity. We propose ConvLLaVA, which employs ConvNeXt, a hierarchical backbone, as the visual encoder of LMM. ConvLLaVA compresses high-resolution images into information-rich visual features, effectively preventing the generation of excessive visual tokens.
arXiv Detail & Related papers (2024-05-24T17:34:15Z)
MeLo: Low-rank Adaptation is Better than Fine-tuning for Medical Image Diagnosis [63.59184480010552]
Vision Transformers (ViT) have become much larger and less accessible to medical imaging communities. MeLo (Medical image Low-rank adaptation) adopts low-rank adaptation instead of resource-demanding fine-tuning. Our proposed method achieves comparable performance to fully fine-tuned ViT models on four distinct medical imaging datasets.
arXiv Detail & Related papers (2023-11-14T15:18:54Z)
Reversible Vision Transformers [74.3500977090597]
Reversible Vision Transformers are a memory efficient architecture for visual recognition. We adapt two popular models, namely Vision Transformer and Multiscale Vision Transformers, to reversible variants. We find that the additional computational burden of recomputing activations is more than overcome for deeper models.
arXiv Detail & Related papers (2023-02-09T18:59:54Z)
Patch Gradient Descent: Training Neural Networks on Very Large Images [13.969180905165533]
We propose Patch Gradient Descent (PatchGD) to train existing CNN architectures on large-scale images. PatchGD is based on the hypothesis that instead of performing gradient-based updates on an entire image at once, it should be possible to achieve a good solution by performing model updates on only small parts of the image. Our evaluation shows that PatchGD is much more stable and efficient than the standard gradient-descent method in handling large images.
arXiv Detail & Related papers (2023-01-31T18:04:35Z)
FlexiViT: One Model for All Patch Sizes [100.52574011880571]
Vision Transformers convert images to sequences by slicing them into patches. The size of these patches controls a speed/accuracy tradeoff, with smaller patches leading to higher accuracy at greater computational cost. We show that simply randomizing the patch size at training time leads to a single set of weights that performs well across a wide range of patch sizes.
arXiv Detail & Related papers (2022-12-15T18:18:38Z)
Iterative Patch Selection for High-Resolution Image Recognition [10.847032625429717]
We propose a simple method, Iterative Patch Selection (IPS), which decouples the memory usage from the input size. IPS achieves this by selecting only the most salient patches, which are then aggregated into a global representation for image recognition. Our method demonstrates strong performance and has wide applicability across different domains, training regimes and image sizes while using minimal accelerator memory.
arXiv Detail & Related papers (2022-10-24T07:55:57Z)
Swin Transformer V2: Scaling Up Capacity and Resolution [45.462916348268664]
We present techniques for scaling Swin Transformer up to 3 billion parameters and making it capable of training with images of up to 1,536$times$1,536 resolution. By scaling up capacity and resolution, Swin Transformer sets new records on four representative vision benchmarks.
arXiv Detail & Related papers (2021-11-18T18:59:33Z)
Restormer: Efficient Transformer for High-Resolution Image Restoration [118.9617735769827]
convolutional neural networks (CNNs) perform well at learning generalizable image priors from large-scale data. Transformers have shown significant performance gains on natural language and high-level vision tasks. Our model, named Restoration Transformer (Restormer), achieves state-of-the-art results on several image restoration tasks.
arXiv Detail & Related papers (2021-11-18T18:59:10Z)
Patch Slimming for Efficient Vision Transformers [107.21146699082819]
We study the efficiency problem for visual transformers by excavating redundant calculation in given networks. We present a novel patch slimming approach that discards useless patches in a top-down paradigm. Experimental results on benchmark datasets demonstrate that the proposed method can significantly reduce the computational costs of vision transformers.
arXiv Detail & Related papers (2021-06-05T09:46:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.