Bridging The Gaps Between Token Pruning and Full Pre-training via Masked
Fine-tuning
- URL: http://arxiv.org/abs/2310.17177v1
- Date: Thu, 26 Oct 2023 06:03:18 GMT
- Title: Bridging The Gaps Between Token Pruning and Full Pre-training via Masked
Fine-tuning
- Authors: Fengyuan Shi, Limin Wang
- Abstract summary: Dynamic vision transformers are used to accelerate inference by pruning tokens redundant.
Current base models usually adopt full image training, using full images as inputs and keeping the whole feature maps through the forward process.
Inspired by MAE which performs masking and reconstruction self-supervised task, we devise masked fine-tuning to bridge the gaps between pre-trained base models and token pruning based dynamic vision transformers.
- Score: 19.391064062033436
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the success of transformers on various computer vision tasks, they
suffer from excessive memory and computational cost. Some works present dynamic
vision transformers to accelerate inference by pruning redundant tokens. A key
to improving token pruning is using well-trained models as initialization for
faster convergence and better performance. However, current base models usually
adopt full image training, i.e., using full images as inputs and keeping the
whole feature maps through the forward process, which causes inconsistencies
with dynamic models that gradually reduce tokens, including calculation
pattern, information amount and token selection strategy inconsistencies.
Inspired by MAE which performs masking and reconstruction self-supervised task,
we devise masked fine-tuning to bridge the gaps between pre-trained base models
used for initialization and token pruning based dynamic vision transformers, by
masking image patches and predicting the image class label based on left
unmasked patches. Extensive experiments on ImageNet demonstrate that base
models via masked fine-tuning gain strong occlusion robustness and ability
against information loss. With this better initialization, Dynamic ViT achieves
higher accuracies, especially under large token pruning ratios (e.g., 81.9% vs.
81.3%, and 62.3% vs. 58.9% for DeiT based Dynamic ViT/0.8 and Dynamic ViT/0.3).
Moreover, we apply our method into different token pruning based dynamic vision
transformers, different pre-trained models and randomly initialized models to
demonstrate the generalization ability.
Related papers
- No Token Left Behind: Efficient Vision Transformer via Dynamic Token
Idling [55.203866875294516]
Vision Transformers (ViTs) have demonstrated outstanding performance in computer vision tasks.
Various token pruning techniques have been introduced to alleviate the high computational burden of ViTs.
We propose IdleViT, a dynamic token-idle-based method that achieves an excellent trade-off between performance and efficiency.
arXiv Detail & Related papers (2023-10-09T12:10:41Z) - Centroid-centered Modeling for Efficient Vision Transformer Pre-training [44.24223088955106]
Masked Image Modeling (MIM) is a new self-supervised vision pre-training paradigm using a Vision Transformer (ViT)
Our proposed centroid-based approach, CCViT, leverages k-means clustering to obtain centroids for image modeling without supervised training of the tokenizer model.
Our approach achieves competitive results with recent baselines without external supervision and distillation training from other models.
arXiv Detail & Related papers (2023-03-08T15:34:57Z) - DynaST: Dynamic Sparse Transformer for Exemplar-Guided Image Generation [56.514462874501675]
We propose a dynamic sparse attention based Transformer model to achieve fine-level matching with favorable efficiency.
The heart of our approach is a novel dynamic-attention unit, dedicated to covering the variation on the optimal number of tokens one position should focus on.
Experiments on three applications, pose-guided person image generation, edge-based face synthesis, and undistorted image style transfer, demonstrate that DynaST achieves superior performance in local details.
arXiv Detail & Related papers (2022-07-13T11:12:03Z) - Corrupted Image Modeling for Self-Supervised Visual Pre-Training [103.99311611776697]
We introduce Corrupted Image Modeling (CIM) for self-supervised visual pre-training.
CIM uses an auxiliary generator with a small trainable BEiT to corrupt the input image instead of using artificial mask tokens.
After pre-training, the enhancer can be used as a high-capacity visual encoder for downstream tasks.
arXiv Detail & Related papers (2022-02-07T17:59:04Z) - Masked Autoencoders Are Scalable Vision Learners [60.97703494764904]
Masked autoencoders (MAE) are scalable self-supervised learners for computer vision.
Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels.
Coupling these two designs enables us to train large models efficiently and effectively.
arXiv Detail & Related papers (2021-11-11T18:46:40Z) - BEiT: BERT Pre-Training of Image Transformers [43.704968112586876]
We introduce a self-supervised vision representation model BEiT, which stands for Bidirectional representation from Image Transformers.
Specifically, each image has two views in our pre-training, i.e., image patches, and visual tokens.
We first "tokenize" the original image into visual tokens. Then we randomly mask some image patches and fed them into the backbone Transformer.
The pre-training objective is to recover the original visual tokens based on the corrupted image patches.
arXiv Detail & Related papers (2021-06-15T16:02:37Z) - DynamicViT: Efficient Vision Transformers with Dynamic Token
Sparsification [134.9393799043401]
We propose a dynamic token sparsification framework to prune redundant tokens based on the input.
By hierarchically pruning 66% of the input tokens, our method greatly reduces 31%37% FLOPs and improves the throughput by over 40%.
DynamicViT models can achieve very competitive complexity/accuracy trade-offs compared to state-of-the-art CNNs and vision transformers on ImageNet.
arXiv Detail & Related papers (2021-06-03T17:57:41Z) - When Vision Transformers Outperform ResNets without Pretraining or
Strong Data Augmentations [111.44860506703307]
Vision Transformers (ViTs) and existing VisionNets signal efforts on replacing hand-wired features or inductive throughputs with general-purpose neural architectures.
This paper investigates ViTs and Res-Mixers from the lens of loss geometry, intending to improve the models' data efficiency at training and inference.
We show that the improved robustness attributes to sparser active neurons in the first few layers.
The resultant ViTs outperform Nets of similar size and smoothness when trained from scratch on ImageNet without large-scale pretraining or strong data augmentations.
arXiv Detail & Related papers (2021-06-03T02:08:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.