Accelerating Vision Transformer Training via a Patch Sampling Schedule
- URL: http://arxiv.org/abs/2208.09520v1
- Date: Fri, 19 Aug 2022 19:16:46 GMT
- Title: Accelerating Vision Transformer Training via a Patch Sampling Schedule
- Authors: Bradley McDanel, Chi Phuong Huynh
- Abstract summary: We introduce the notion of a Patch Sampling Schedule (PSS)
PSS varies the number of Vision Transformer (ViT) patches used per batch during training.
We observe that training with a PSS makes a ViT more robust to a wider patch sampling range during inference.
- Score: 0.685316573653194
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce the notion of a Patch Sampling Schedule (PSS), that varies the
number of Vision Transformer (ViT) patches used per batch during training.
Since all patches are not equally important for most vision objectives (e.g.,
classification), we argue that less important patches can be used in fewer
training iterations, leading to shorter training time with minimal impact on
performance. Additionally, we observe that training with a PSS makes a ViT more
robust to a wider patch sampling range during inference. This allows for a
fine-grained, dynamic trade-off between throughput and accuracy during
inference. We evaluate using PSSs on ViTs for ImageNet both trained from
scratch and pre-trained using a reconstruction loss function. For the
pre-trained model, we achieve a 0.26% reduction in classification accuracy for
a 31% reduction in training time (from 25 to 17 hours) compared to using all
patches each iteration. Code, model checkpoints and logs are available at
https://github.com/BradMcDanel/pss.
Related papers
- Patch-Level Training for Large Language Models [69.67438563485887]
This paper introduces patch-level training for Large Language Models (LLMs)
During patch-level training, we feed the language model shorter sequences of patches and train it to predict the next patch.
Following this, the model continues token-level training on the remaining training data to align with the inference mode.
arXiv Detail & Related papers (2024-07-17T15:48:39Z) - PaPr: Training-Free One-Step Patch Pruning with Lightweight ConvNets for Faster Inference [11.112356346406365]
PaPr is a method for substantially pruning redundant patches with minimal accuracy loss using lightweight ConvNets.
It achieves significantly higher accuracy over state-of-the-art patch reduction methods with similar FLOP count reduction.
arXiv Detail & Related papers (2024-03-24T05:50:00Z) - FlexiViT: One Model for All Patch Sizes [100.52574011880571]
Vision Transformers convert images to sequences by slicing them into patches.
The size of these patches controls a speed/accuracy tradeoff, with smaller patches leading to higher accuracy at greater computational cost.
We show that simply randomizing the patch size at training time leads to a single set of weights that performs well across a wide range of patch sizes.
arXiv Detail & Related papers (2022-12-15T18:18:38Z) - Peeling the Onion: Hierarchical Reduction of Data Redundancy for
Efficient Vision Transformer Training [110.79400526706081]
Vision transformers (ViTs) have recently obtained success in many applications, but their intensive computation and heavy memory usage limit their generalization.
Previous compression algorithms usually start from the pre-trained dense models and only focus on efficient inference.
This paper proposes an end-to-end efficient training framework from three sparse perspectives, dubbed Tri-Level E-ViT.
arXiv Detail & Related papers (2022-11-19T21:15:47Z) - CowClip: Reducing CTR Prediction Model Training Time from 12 hours to 10
minutes on 1 GPU [14.764217935910988]
A click-through rate (CTR) prediction task is to predict whether a user will click on the recommended item.
One approach to increase the training speed is to apply large batch training.
We develop the adaptive Column-wise Clipping (CowClip) to stabilize the training process in a large batch size setting.
arXiv Detail & Related papers (2022-04-13T08:17:15Z) - PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers [102.7922200135147]
This paper explores a better codebook for BERT pre-training of vision transformers.
By contrast, the discrete tokens in NLP field are naturally highly semantic.
We demonstrate that the visual tokens generated by the proposed perceptual codebook do exhibit better semantic meanings.
arXiv Detail & Related papers (2021-11-24T18:59:58Z) - Knowing When to Quit: Selective Cascaded Regression with Patch Attention
for Real-Time Face Alignment [0.0]
We show that frontal faces with neutral expressions converge faster than faces with extreme poses or expressions.
We offer a multi-scale, patch-based, lightweight feature extractor with a fine-grained local patch attention module.
Our model runs in real-time on a mobile device GPU, with 95 Mega Multiply-Add (MMA) operations, outperforming all state-of-the-art methods under 1000 MMA.
arXiv Detail & Related papers (2021-08-01T06:51:47Z) - Patch Slimming for Efficient Vision Transformers [107.21146699082819]
We study the efficiency problem for visual transformers by excavating redundant calculation in given networks.
We present a novel patch slimming approach that discards useless patches in a top-down paradigm.
Experimental results on benchmark datasets demonstrate that the proposed method can significantly reduce the computational costs of vision transformers.
arXiv Detail & Related papers (2021-06-05T09:46:00Z) - Unsupervised Visual Representation Learning by Tracking Patches in Video [88.56860674483752]
We propose to use tracking as a proxy task for a computer vision system to learn the visual representations.
Modelled on the Catch game played by the children, we design a Catch-the-Patch (CtP) game for a 3D-CNN model to learn visual representations.
arXiv Detail & Related papers (2021-05-06T09:46:42Z) - Update Frequently, Update Fast: Retraining Semantic Parsing Systems in a
Fraction of Time [11.035461657669096]
We show that it is possible to match the performance of a model trained from scratch in less than 10% of a time via fine-tuning.
We demonstrate the effectiveness of our method on multiple splits of the Facebook TOP and SNIPS datasets.
arXiv Detail & Related papers (2020-10-15T16:37:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.