Related papers: A General and Efficient Training for Transformer via Token Expansion

A General and Efficient Training for Transformer via Token Expansion

URL: http://arxiv.org/abs/2404.00672v1
Date: Sun, 31 Mar 2024 12:44:24 GMT
Title: A General and Efficient Training for Transformer via Token Expansion
Authors: Wenxuan Huang, Yunhang Shen, Jiao Xie, Baochang Zhang, Gaoqi He, Ke Li, Xing Sun, Shaohui Lin,
Abstract summary: Vision Transformers (ViTs) typically require an extremely large training cost. Existing methods have attempted to accelerate the training of ViTs, yet typically disregard method with accuracy dropping. We propose a novel token growth scheme Token Expansion (termed ToE) to achieve consistent training acceleration for ViTs.
Score: 44.002355107931805
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The remarkable performance of Vision Transformers (ViTs) typically requires an extremely large training cost. Existing methods have attempted to accelerate the training of ViTs, yet typically disregard method universality with accuracy dropping. Meanwhile, they break the training consistency of the original transformers, including the consistency of hyper-parameters, architecture, and strategy, which prevents them from being widely applied to different Transformer networks. In this paper, we propose a novel token growth scheme Token Expansion (termed ToE) to achieve consistent training acceleration for ViTs. We introduce an "initialization-expansion-merging" pipeline to maintain the integrity of the intermediate feature distribution of original transformers, preventing the loss of crucial learnable information in the training process. ToE can not only be seamlessly integrated into the training and fine-tuning process of transformers (e.g., DeiT and LV-ViT), but also effective for efficient training frameworks (e.g., EfficientTrain), without twisting the original training hyper-parameters, architecture, and introducing additional training strategies. Extensive experiments demonstrate that ToE achieves about 1.3x faster for the training of ViTs in a lossless manner, or even with performance gains over the full-token training baselines. Code is available at https://github.com/Osilly/TokenExpansion .

Related papers

DNT: a Deeply Normalized Transformer that can be trained by Momentum SGD [43.19878131775045]
We introduce a Deeply Normalized Transformer (DNT) to overcome the limitation enabling seamless training with vanilla mSGDW.<n>To be specific, in DNT, we strategically integrate normalization techniques at proper positions in the Transformers to effectively modulate the Jacobian matrices of each layer.<n>We provide both theoretical justifications of the normalization technique used in our DNT and extensive empirical evaluation on two popular Transformer architectures.
arXiv Detail & Related papers (2025-07-23T13:37:23Z)
Transformers Learn to Implement Multi-step Gradient Descent with Chain of Thought [46.71030329872635]
Chain of Thought (CoT) prompting has been shown to significantly improve the performance of large language models (LLMs) We study the training dynamics of transformers over a CoT objective on an in-context weight prediction task for linear regression.
arXiv Detail & Related papers (2025-02-28T16:40:38Z)
Exploring the Benefit of Activation Sparsity in Pre-training [117.25661020250658]
We study how activation properties change during pre-training. We propose Switchable Sparse-Dense Learning (SSD) SSD achieves comparable performance with identical model size and reduces pre-training costs.
arXiv Detail & Related papers (2024-10-04T13:53:33Z)
Efficient Stagewise Pretraining via Progressive Subnetworks [53.00045381931778]
The prevailing view suggests that stagewise dropping strategies, such as layer dropping, are ineffective when compared to stacking-based approaches. This paper challenges this notion by demonstrating that, with proper design, dropping strategies can be competitive, if not better, than stacking methods. We propose an instantiation of this framework - Random Part Training (RAPTR) - that selects and trains only a random subnetwork at each step, progressively increasing the size in stages.
arXiv Detail & Related papers (2024-02-08T18:49:09Z)
ClipFormer: Key-Value Clipping of Transformers on Memristive Crossbars for Write Noise Mitigation [6.853523674099236]
In-memory computing (IMC) crossbars based on Non-volatile Memories (NVMs) have emerged as a promising solution for accelerating transformers. We find pre-trained Vision Transformers (ViTs) to be vulnerable on crossbars due to the impact of dynamically generate write noise. We propose a new memristive crossbar platform to boost the non-ideal accuracies of pre-trained ViT models.
arXiv Detail & Related papers (2024-02-04T19:04:37Z)
Experts Weights Averaging: A New General Training Scheme for Vision Transformers [57.62386892571636]
We propose a training scheme for Vision Transformers (ViTs) that achieves performance improvement without increasing inference cost. During training, we replace some Feed-Forward Networks (FFNs) of the ViT with specially designed, more efficient MoEs. After training, we convert each MoE into an FFN by averaging the experts, transforming the model back into original ViT for inference.
arXiv Detail & Related papers (2023-08-11T12:05:12Z)
Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers [107.3726071306935]
We propose a new plug-and-play training framework, SMoE-Dropout, to enable scaling transformers to better accuracy in their full capacity without collapse. SMoE-Dropout consists of a randomly and fixed router network to activate experts and gradually increases the activated expert number as training progresses over time. Our experiments demonstrate the superior performance and substantial computation savings of SMoE-Dropout, compared to dense training baselines with equivalent parameter counts.
arXiv Detail & Related papers (2023-03-02T22:12:51Z)
Adaptive Attention Link-based Regularization for Vision Transformers [6.6798113365140015]
We present a regularization technique to improve the training efficiency of Vision Transformers (ViT) The trainable links are referred to as the attention augmentation module, which is trained simultaneously with ViT. We can extract the relevant relationship between each CNN activation map and each ViT attention head, and based on this, we also propose an advanced attention augmentation module.
arXiv Detail & Related papers (2022-11-25T01:26:43Z)
Automated Progressive Learning for Efficient Training of Vision Transformers [125.22744987949227]
Vision Transformers (ViTs) have come with a voracious appetite for computing power, high-lighting the urgent need to develop efficient training methods for ViTs. Progressive learning, a training scheme where the model capacity grows progressively during training, has started showing its ability in efficient training. In this paper, we take a practical step towards efficient training of ViTs by customizing and automating progressive learning.
arXiv Detail & Related papers (2022-03-28T05:37:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.