Accelerating Training of Transformer-Based Language Models with
Progressive Layer Dropping
- URL: http://arxiv.org/abs/2010.13369v1
- Date: Mon, 26 Oct 2020 06:50:07 GMT
- Title: Accelerating Training of Transformer-Based Language Models with
Progressive Layer Dropping
- Authors: Minjia Zhang and Yuxiong He
- Abstract summary: The proposed method achieves a 24% time reduction on average per sample and allows the pre-training to be 2.5 times faster than the baseline.
While being faster, our pre-trained models are equipped with strong knowledge transferability, achieving comparable and sometimes higher GLUE score than the baseline.
- Score: 24.547833264405355
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, Transformer-based language models have demonstrated remarkable
performance across many NLP domains. However, the unsupervised pre-training
step of these models suffers from unbearable overall computational expenses.
Current methods for accelerating the pre-training either rely on massive
parallelism with advanced hardware or are not applicable to language modeling.
In this work, we propose a method based on progressive layer dropping that
speeds the training of Transformer-based language models, not at the cost of
excessive hardware resources but from model architecture change and training
technique boosted efficiency. Extensive experiments on BERT show that the
proposed method achieves a 24% time reduction on average per sample and allows
the pre-training to be 2.5 times faster than the baseline to get a similar
accuracy on downstream tasks. While being faster, our pre-trained models are
equipped with strong knowledge transferability, achieving comparable and
sometimes higher GLUE score than the baseline when pre-trained with the same
number of samples.
Related papers
- DiJiang: Efficient Large Language Models through Compact Kernelization [30.24187657746638]
We present a novel Frequency Domain Kernelization approach that enables the transformation of a pre-trained vanilla Transformer into a linear complexity model with little training costs.
Experiments demonstrate that the proposed method achieves comparable performance to the original Transformer, but with significantly reduced training costs and much faster inference speeds.
arXiv Detail & Related papers (2024-03-29T02:32:15Z) - Preparing Lessons for Progressive Training on Language Models [75.88952808979087]
The rapid progress of Transformers in artificial intelligence has come at the cost of increased resource consumption and greenhouse gas emissions.
We propose Apollo, which preptextbfares lessons for extextbfpanding textbfoperations by textbflayer functitextbfonality during training of low layers.
Experiments demonstrate that Apollo achieves state-of-the-art acceleration ratios, even rivaling methods using pretrained models.
arXiv Detail & Related papers (2024-01-17T13:04:14Z) - Fast Propagation is Better: Accelerating Single-Step Adversarial
Training via Sampling Subnetworks [69.54774045493227]
A drawback of adversarial training is the computational overhead introduced by the generation of adversarial examples.
We propose to exploit the interior building blocks of the model to improve efficiency.
Compared with previous methods, our method not only reduces the training cost but also achieves better model robustness.
arXiv Detail & Related papers (2023-10-24T01:36:20Z) - An Emulator for Fine-Tuning Large Language Models using Small Language
Models [91.02498576056057]
We introduce emulated fine-tuning (EFT), a principled and practical method for sampling from a distribution that approximates the result of pre-training and fine-tuning at different scales.
We show that EFT enables test-time adjustment of competing behavioral traits like helpfulness and harmlessness without additional training.
Finally, a special case of emulated fine-tuning, which we call LM up-scaling, avoids resource-intensive fine-tuning of large pre-trained models by ensembling them with small fine-tuned models.
arXiv Detail & Related papers (2023-10-19T17:57:16Z) - Efficient Training of Neural Transducer for Speech Recognition [44.99337868233026]
We propose an efficient 3-stage progressive training pipeline to build highly-performing neural transducer models from scratch.
The proposed pipeline is able to train transducer models approaching state-of-the-art performance with a single GPU in just 2-3 weeks.
arXiv Detail & Related papers (2022-04-22T09:16:51Z) - MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided
Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed.
We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z) - ScaLA: Accelerating Adaptation of Pre-Trained Transformer-Based Language
Models via Efficient Large-Batch Adversarial Noise [20.779167087445995]
Large pretrained Transformer-based language models have led to dramatic improvements in many natural language understanding tasks.
ScaLA is a novel and efficient method to accelerate the speed of transformer networks.
Experiment results show that ScaLA attains 2.7-UE-9.8$times$ adaptation speedups over the baseline for GLLA on BERT-base RoBERTa-large.
arXiv Detail & Related papers (2022-01-29T01:47:01Z) - Improving Non-autoregressive Generation with Mixup Training [51.61038444990301]
We present a non-autoregressive generation model based on pre-trained transformer models.
We propose a simple and effective iterative training method called MIx Source and pseudo Target.
Our experiments on three generation benchmarks including question generation, summarization and paraphrase generation, show that the proposed framework achieves the new state-of-the-art results.
arXiv Detail & Related papers (2021-10-21T13:04:21Z) - bert2BERT: Towards Reusable Pretrained Language Models [51.078081486422896]
We propose bert2BERT, which can effectively transfer the knowledge of an existing smaller pre-trained model to a large model.
bert2BERT saves about 45% and 47% computational cost of pre-training BERT_BASE and GPT_BASE by reusing the models of almost their half sizes.
arXiv Detail & Related papers (2021-10-14T04:05:25Z) - Progressively Stacking 2.0: A Multi-stage Layerwise Training Method for
BERT Training Speedup [13.50984315473865]
We propose an efficient multi-stage layerwise training (MSLT) approach to reduce the training time of BERT.
In the proposed training strategy, only top few layers participate in backward computation, while most layers only participate in forward computation.
Experimental results show that the proposed method can achieve more than 110% training speedup without significant performance degradation.
arXiv Detail & Related papers (2020-11-27T10:00:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.