EarlyBERT: Efficient BERT Training via Early-bird Lottery Tickets
- URL: http://arxiv.org/abs/2101.00063v1
- Date: Thu, 31 Dec 2020 20:38:20 GMT
- Title: EarlyBERT: Efficient BERT Training via Early-bird Lottery Tickets
- Authors: Xiaohan Chen, Yu Cheng, Shuohang Wang, Zhe Gan, Zhangyang Wang,
Jingjing Liu
- Abstract summary: EarlyBERT is a general computationally-efficient training algorithm applicable to both pre-training and fine-tuning of large-scale language models.
We are the first to identify structured winning tickets in the early stage of BERT training, and use them for efficient training.
EarlyBERT easily achieves comparable performance to standard BERT with 3545% less training time.
- Score: 106.79387235014379
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep, heavily overparameterized language models such as BERT, XLNet and T5
have achieved impressive success in many NLP tasks. However, their high model
complexity requires enormous computation resources and extremely long training
time for both pre-training and fine-tuning. Many works have studied model
compression on large NLP models, but only focus on reducing inference
cost/time, while still requiring expensive training process. Other works use
extremely large batch sizes to shorten the pre-training time at the expense of
high demand for computation resources. In this paper, inspired by the
Early-Bird Lottery Tickets studied for computer vision tasks, we propose
EarlyBERT, a general computationally-efficient training algorithm applicable to
both pre-training and fine-tuning of large-scale language models. We are the
first to identify structured winning tickets in the early stage of BERT
training, and use them for efficient training. Comprehensive pre-training and
fine-tuning experiments on GLUE and SQuAD downstream tasks show that EarlyBERT
easily achieves comparable performance to standard BERT with 35~45% less
training time.
Related papers
- Knowledge Distillation as Efficient Pre-training: Faster Convergence,
Higher Data-efficiency, and Better Transferability [53.27240222619834]
Knowledge Distillation as Efficient Pre-training aims to efficiently transfer the learned feature representation from pre-trained models to new student models for future downstream tasks.
Our method performs comparably with supervised pre-training counterparts in 3 downstream tasks and 9 downstream datasets requiring 10x less data and 5x less pre-training time.
arXiv Detail & Related papers (2022-03-10T06:23:41Z) - bert2BERT: Towards Reusable Pretrained Language Models [51.078081486422896]
We propose bert2BERT, which can effectively transfer the knowledge of an existing smaller pre-trained model to a large model.
bert2BERT saves about 45% and 47% computational cost of pre-training BERT_BASE and GPT_BASE by reusing the models of almost their half sizes.
arXiv Detail & Related papers (2021-10-14T04:05:25Z) - Curriculum Learning: A Regularization Method for Efficient and Stable
Billion-Scale GPT Model Pre-Training [18.640076155697415]
We present a study of a curriculum learning based approach, which helps improve the pre-training convergence speed of autoregressive models.
Our evaluations demonstrate that curriculum learning enables training GPT-2 models with 8x larger batch size and 4x larger learning rate.
arXiv Detail & Related papers (2021-08-13T06:32:53Z) - Self-Supervised Pretraining Improves Self-Supervised Pretraining [83.1423204498361]
Self-supervised pretraining requires expensive and lengthy computation, large amounts of data, and is sensitive to data augmentation.
This paper explores Hierarchical PreTraining (HPT), which decreases convergence time and improves accuracy by initializing the pretraining process with an existing pretrained model.
We show HPT converges up to 80x faster, improves accuracy across tasks, and improves the robustness of the self-supervised pretraining process to changes in the image augmentation policy or amount of pretraining data.
arXiv Detail & Related papers (2021-03-23T17:37:51Z) - Progressively Stacking 2.0: A Multi-stage Layerwise Training Method for
BERT Training Speedup [13.50984315473865]
We propose an efficient multi-stage layerwise training (MSLT) approach to reduce the training time of BERT.
In the proposed training strategy, only top few layers participate in backward computation, while most layers only participate in forward computation.
Experimental results show that the proposed method can achieve more than 110% training speedup without significant performance degradation.
arXiv Detail & Related papers (2020-11-27T10:00:22Z) - CoRe: An Efficient Coarse-refined Training Framework for BERT [17.977099111813644]
We propose a novel coarse-refined training framework named CoRe to speed up the training of BERT.
In the first phase, we construct a relaxed BERT model which has much less parameters and much lower model complexity than the original BERT.
In the second phase, we transform the trained relaxed BERT model into the original BERT and further retrain the model.
arXiv Detail & Related papers (2020-11-27T09:49:37Z) - Improving NER's Performance with Massive financial corpus [6.935911489364734]
Training large deep neural networks needs massive high quality annotation data, but the time and labor costs are too expensive for small business.
We start a company-name recognition task with a small scale and low quality training data, then using skills to enhanced model training speed and predicting performance with minimum labor cost.
arXiv Detail & Related papers (2020-07-31T07:00:34Z) - The Lottery Ticket Hypothesis for Pre-trained BERT Networks [137.99328302234338]
In natural language processing (NLP), enormous pre-trained models like BERT have become the standard starting point for training.
In parallel, work on the lottery ticket hypothesis has shown that models for NLP and computer vision contain smaller matchingworks capable of training in isolation to full accuracy.
We combine these observations to assess whether such trainable, transferrableworks exist in pre-trained BERT models.
arXiv Detail & Related papers (2020-07-23T19:35:39Z) - The Right Tool for the Job: Matching Model and Instance Complexities [62.95183777679024]
As NLP models become larger, executing a trained model requires significant computational resources incurring monetary and environmental costs.
We propose a modification to contextual representation fine-tuning which, during inference, allows for an early (and fast) "exit"
We test our proposed modification on five different datasets in two tasks: three text classification datasets and two natural language inference benchmarks.
arXiv Detail & Related papers (2020-04-16T04:28:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.