Peeling the Onion: Hierarchical Reduction of Data Redundancy for
Efficient Vision Transformer Training
- URL: http://arxiv.org/abs/2211.10801v1
- Date: Sat, 19 Nov 2022 21:15:47 GMT
- Title: Peeling the Onion: Hierarchical Reduction of Data Redundancy for
Efficient Vision Transformer Training
- Authors: Zhenglun Kong, Haoyu Ma, Geng Yuan, Mengshu Sun, Yanyue Xie, Peiyan
Dong, Xin Meng, Xuan Shen, Hao Tang, Minghai Qin, Tianlong Chen, Xiaolong Ma,
Xiaohui Xie, Zhangyang Wang, Yanzhi Wang
- Abstract summary: Vision transformers (ViTs) have recently obtained success in many applications, but their intensive computation and heavy memory usage limit their generalization.
Previous compression algorithms usually start from the pre-trained dense models and only focus on efficient inference.
This paper proposes an end-to-end efficient training framework from three sparse perspectives, dubbed Tri-Level E-ViT.
- Score: 110.79400526706081
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision transformers (ViTs) have recently obtained success in many
applications, but their intensive computation and heavy memory usage at both
training and inference time limit their generalization. Previous compression
algorithms usually start from the pre-trained dense models and only focus on
efficient inference, while time-consuming training is still unavoidable. In
contrast, this paper points out that the million-scale training data is
redundant, which is the fundamental reason for the tedious training. To address
the issue, this paper aims to introduce sparsity into data and proposes an
end-to-end efficient training framework from three sparse perspectives, dubbed
Tri-Level E-ViT. Specifically, we leverage a hierarchical data redundancy
reduction scheme, by exploring the sparsity under three levels: number of
training examples in the dataset, number of patches (tokens) in each example,
and number of connections between tokens that lie in attention weights. With
extensive experiments, we demonstrate that our proposed technique can
noticeably accelerate training for various ViT architectures while maintaining
accuracy. Remarkably, under certain ratios, we are able to improve the ViT
accuracy rather than compromising it. For example, we can achieve 15.2% speedup
with 72.6% (+0.4) Top-1 accuracy on Deit-T, and 15.7% speedup with 79.9% (+0.1)
Top-1 accuracy on Deit-S. This proves the existence of data redundancy in ViT.
Related papers
- FasterDiT: Towards Faster Diffusion Transformers Training without Architecture Modification [35.105593013654]
Diffusion Transformers (DiT) suffer from a slow convergence rate.
We aim to accelerate DiT training without any architectural modification.
We propose FasterDiT, an exceedingly simple and practicable design strategy.
arXiv Detail & Related papers (2024-10-14T10:17:24Z) - Effective pruning of web-scale datasets based on complexity of concept
clusters [48.125618324485195]
We present a method for pruning large-scale multimodal datasets for training CLIP-style models on ImageNet.
We find that training on a smaller set of high-quality data can lead to higher performance with significantly lower training costs.
We achieve a new state-of-the-art Imagehttps://info.arxiv.org/help/prep#commentsNet zero-shot accuracy and a competitive average zero-shot accuracy on 38 evaluation tasks.
arXiv Detail & Related papers (2024-01-09T14:32:24Z) - Efficient Grammatical Error Correction Via Multi-Task Training and
Optimized Training Schedule [55.08778142798106]
We propose auxiliary tasks that exploit the alignment between the original and corrected sentences.
We formulate each task as a sequence-to-sequence problem and perform multi-task training.
We find that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance.
arXiv Detail & Related papers (2023-11-20T14:50:12Z) - Data-Efficient Augmentation for Training Neural Networks [15.870155099135538]
We propose a rigorous technique to select subsets of data points that when augmented, closely capture the training dynamics of full data augmentation.
Our method achieves 6.3x speedup on CIFAR10 and 2.2x speedup on SVHN, and outperforms the baselines by up to 10% across various subset sizes.
arXiv Detail & Related papers (2022-10-15T19:32:20Z) - Quantized Training of Gradient Boosting Decision Trees [84.97123593657584]
We propose to quantize all the high-precision gradients in a very simple yet effective way in the GBDT's training algorithm.
With low-precision gradients, most arithmetic operations in GBDT training can be replaced by integer operations of 8, 16, or 32 bits.
We observe up to 2$times$ speedup of our simple quantization strategy compared with SOTA GBDT systems on extensive datasets.
arXiv Detail & Related papers (2022-07-20T06:27:06Z) - Adversarial Unlearning: Reducing Confidence Along Adversarial Directions [88.46039795134993]
We propose a complementary regularization strategy that reduces confidence on self-generated examples.
The method, which we call RCAD, aims to reduce confidence on out-of-distribution examples lying along directions adversarially chosen to increase training loss.
Despite its simplicity, we find on many classification benchmarks that RCAD can be added to existing techniques to increase test accuracy by 1-3% in absolute value.
arXiv Detail & Related papers (2022-06-03T02:26:24Z) - Knowledge Distillation as Efficient Pre-training: Faster Convergence,
Higher Data-efficiency, and Better Transferability [53.27240222619834]
Knowledge Distillation as Efficient Pre-training aims to efficiently transfer the learned feature representation from pre-trained models to new student models for future downstream tasks.
Our method performs comparably with supervised pre-training counterparts in 3 downstream tasks and 9 downstream datasets requiring 10x less data and 5x less pre-training time.
arXiv Detail & Related papers (2022-03-10T06:23:41Z) - Chasing Sparsity in Vision Transformers: An End-to-End Exploration [127.10054032751714]
Vision transformers (ViTs) have recently received explosive popularity, but their enormous model sizes and training costs remain daunting.
This paper aims to trim down both the training memory overhead and the inference complexity, without sacrificing the achievable accuracy.
Specifically, instead of training full ViTs, we dynamically extract and train sparseworks, while sticking to a fixed small parameter budget.
arXiv Detail & Related papers (2021-06-08T17:18:00Z) - Compression-aware Continual Learning using Singular Value Decomposition [2.4283778735260686]
We propose a compression based continual task learning method that can dynamically grow a neural network.
Inspired by the recent model compression techniques, we employ compression-aware training and perform low-rank weight approximations.
Our method achieves compressed representations with minimal performance degradation without the need for costly fine-tuning.
arXiv Detail & Related papers (2020-09-03T23:29:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.