Preparing Lessons for Progressive Training on Language Models
- URL: http://arxiv.org/abs/2401.09192v3
- Date: Sat, 10 Feb 2024 14:52:49 GMT
- Title: Preparing Lessons for Progressive Training on Language Models
- Authors: Yu Pan, Ye Yuan, Yichun Yin, Jiaxin Shi, Zenglin Xu, Ming Zhang,
Lifeng Shang, Xin Jiang, Qun Liu
- Abstract summary: The rapid progress of Transformers in artificial intelligence has come at the cost of increased resource consumption and greenhouse gas emissions.
We propose Apollo, which preptextbfares lessons for extextbfpanding textbfoperations by textbflayer functitextbfonality during training of low layers.
Experiments demonstrate that Apollo achieves state-of-the-art acceleration ratios, even rivaling methods using pretrained models.
- Score: 75.88952808979087
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapid progress of Transformers in artificial intelligence has come at the
cost of increased resource consumption and greenhouse gas emissions due to
growing model sizes. Prior work suggests using pretrained small models to
improve training efficiency, but this approach may not be suitable for new
model structures. On the other hand, training from scratch can be slow, and
progressively stacking layers often fails to achieve significant acceleration.
To address these challenges, we propose a novel method called Apollo, which
prep\textbf{a}res lessons for ex\textbf{p}anding \textbf{o}perations by
\textbf{l}earning high-\textbf{l}ayer functi\textbf{o}nality during training of
low layers. Our approach involves low-value-prioritized sampling (LVPS) to
train different depths and weight sharing to facilitate efficient expansion. We
also introduce an interpolation method for stable model depth extension.
Experiments demonstrate that Apollo achieves state-of-the-art acceleration
ratios, even rivaling methods using pretrained models, making it a universal
and efficient solution for training deep models while reducing time, financial,
and environmental costs.
Related papers
- AdaRankGrad: Adaptive Gradient-Rank and Moments for Memory-Efficient LLMs Training and Fine-Tuning [9.51289606759621]
Training and fine-tuning large language models (LLMs) come with challenges related to memory and computational requirements.
Various techniques have been developed to tackle these challenges, such as low-rank adaptation (LoRA)
We introduce a new method inspired by a phenomenon we formally prove: as training progresses, the rank of the estimated gradient gradually decreases.
arXiv Detail & Related papers (2024-10-23T13:53:26Z) - SaRA: High-Efficient Diffusion Model Fine-tuning with Progressive Sparse Low-Rank Adaptation [52.6922833948127]
In this work, we investigate the importance of parameters in pre-trained diffusion models.
We propose a novel model fine-tuning method to make full use of these ineffective parameters.
Our method enhances the generative capabilities of pre-trained models in downstream applications.
arXiv Detail & Related papers (2024-09-10T16:44:47Z) - A Multi-Level Framework for Accelerating Training Transformer Models [5.268960238774481]
Training large-scale deep learning models poses an unprecedented demand for computing power.
We propose a multi-level framework for training acceleration based on Coalescing, De-coalescing and Interpolation.
We prove that the proposed framework reduces the computational cost by about 20% on training BERT/GPT-Base models and up to 51.6% on training the BERT-Large model.
arXiv Detail & Related papers (2024-04-07T03:04:34Z) - Progressive Gradient Flow for Robust N:M Sparsity Training in
Transformers [15.27677493050638]
N:M structured sparsity has garnered significant interest as a result of relatively modest overhead and improved efficiency.
There have been efforts to develop training recipes for N:M structured sparsity, they primarily focus on low-sparsity regions.
However, performance of models trained using these approaches tends to decline when confronted with high-sparsity regions.
arXiv Detail & Related papers (2024-02-07T10:55:59Z) - Fast Propagation is Better: Accelerating Single-Step Adversarial
Training via Sampling Subnetworks [69.54774045493227]
A drawback of adversarial training is the computational overhead introduced by the generation of adversarial examples.
We propose to exploit the interior building blocks of the model to improve efficiency.
Compared with previous methods, our method not only reduces the training cost but also achieves better model robustness.
arXiv Detail & Related papers (2023-10-24T01:36:20Z) - Fast-ELECTRA for Efficient Pre-training [83.29484808667532]
ELECTRA pre-trains language models by detecting tokens in a sequence that have been replaced by an auxiliary model.
We propose Fast-ELECTRA, which leverages an existing language model as the auxiliary model.
Our approach rivals the performance of state-of-the-art ELECTRA-style pre-training methods, while significantly eliminating the computation and memory cost brought by the joint training of the auxiliary model.
arXiv Detail & Related papers (2023-10-11T09:55:46Z) - GrowLength: Accelerating LLMs Pretraining by Progressively Growing
Training Length [65.24730341801468]
This paper introduces a novel, simple, and effective method named growlength'' to accelerate the pretraining process of Large Language Models.
Our method progressively increases the training length throughout the pretraining phase, thereby mitigating computational costs and enhancing efficiency.
arXiv Detail & Related papers (2023-10-01T05:25:24Z) - COST-EFF: Collaborative Optimization of Spatial and Temporal Efficiency
with Slenderized Multi-exit Language Models [16.586312156966635]
Transformer-based pre-trained language models (PLMs) mostly suffer from excessive overhead despite their advanced capacity.
Existing statically compressed models are unaware of the diverse complexities between input instances.
We propose a collaborative optimization for PLMs that integrates static model compression and dynamic inference acceleration.
arXiv Detail & Related papers (2022-10-27T15:06:40Z) - Online Convolutional Re-parameterization [51.97831675242173]
We present online convolutional re- parameterization (OREPA), a two-stage pipeline, aiming to reduce the huge training overhead by squeezing the complex training-time block into a single convolution.
Compared with the state-of-the-art re-param models, OREPA is able to save the training-time memory cost by about 70% and accelerate the training speed by around 2x.
We also conduct experiments on object detection and semantic segmentation and show consistent improvements on the downstream tasks.
arXiv Detail & Related papers (2022-04-02T09:50:19Z) - Accelerating Training of Transformer-Based Language Models with
Progressive Layer Dropping [24.547833264405355]
The proposed method achieves a 24% time reduction on average per sample and allows the pre-training to be 2.5 times faster than the baseline.
While being faster, our pre-trained models are equipped with strong knowledge transferability, achieving comparable and sometimes higher GLUE score than the baseline.
arXiv Detail & Related papers (2020-10-26T06:50:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.