Progressively Stacking 2.0: A Multi-stage Layerwise Training Method for
BERT Training Speedup
- URL: http://arxiv.org/abs/2011.13635v1
- Date: Fri, 27 Nov 2020 10:00:22 GMT
- Title: Progressively Stacking 2.0: A Multi-stage Layerwise Training Method for
BERT Training Speedup
- Authors: Cheng Yang, Shengnan Wang, Chao Yang, Yuechuan Li, Ru He, Jingqiao
Zhang
- Abstract summary: We propose an efficient multi-stage layerwise training (MSLT) approach to reduce the training time of BERT.
In the proposed training strategy, only top few layers participate in backward computation, while most layers only participate in forward computation.
Experimental results show that the proposed method can achieve more than 110% training speedup without significant performance degradation.
- Score: 13.50984315473865
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Pre-trained language models, such as BERT, have achieved significant accuracy
gain in many natural language processing tasks. Despite its effectiveness, the
huge number of parameters makes training a BERT model computationally very
challenging. In this paper, we propose an efficient multi-stage layerwise
training (MSLT) approach to reduce the training time of BERT. We decompose the
whole training process into several stages. The training is started from a
small model with only a few encoder layers and we gradually increase the depth
of the model by adding new encoder layers. At each stage, we only train the top
(near the output layer) few encoder layers which are newly added. The
parameters of the other layers which have been trained in the previous stages
will not be updated in the current stage. In BERT training, the backward
computation is much more time-consuming than the forward computation,
especially in the distributed training setting in which the backward
computation time further includes the communication time for gradient
synchronization. In the proposed training strategy, only top few layers
participate in backward computation, while most layers only participate in
forward computation. Hence both the computation and communication efficiencies
are greatly improved. Experimental results show that the proposed method can
achieve more than 110% training speedup without significant performance
degradation.
Related papers
- Patch-Level Training for Large Language Models [69.67438563485887]
This paper introduces patch-level training for Large Language Models (LLMs)
During patch-level training, we feed the language model shorter sequences of patches and train it to predict the next patch.
Following this, the model continues token-level training on the remaining training data to align with the inference mode.
arXiv Detail & Related papers (2024-07-17T15:48:39Z) - Breaking MLPerf Training: A Case Study on Optimizing BERT [9.486916730173661]
We present novel approaches for fast large-scale training of BERT model.
Load balancing is imperative in distributed BERT training since its training are characterized by samples with various lengths.
We propose two new ideas, (1) local presorting based on dataset stratification for load balancing and (2) bucket-wise gradient clipping before allreduce.
arXiv Detail & Related papers (2024-02-04T11:12:17Z) - Towards Memory- and Time-Efficient Backpropagation for Training Spiking
Neural Networks [70.75043144299168]
Spiking Neural Networks (SNNs) are promising energy-efficient models for neuromorphic computing.
We propose the Spatial Learning Through Time (SLTT) method that can achieve high performance while greatly improving training efficiency.
Our method achieves state-of-the-art accuracy on ImageNet, while the memory cost and training time are reduced by more than 70% and 50%, respectively, compared with BPTT.
arXiv Detail & Related papers (2023-02-28T05:01:01Z) - Staged Training for Transformer Language Models [47.99321376123886]
We consider a staged training setup that begins with a small model and incrementally increases the amount of compute used for training.
By initializing each stage with the output of the previous one, the training process effectively re-uses the compute.
We empirically validate our growth operators and staged training for autoregressive language models, showing up to 22% compute savings.
arXiv Detail & Related papers (2022-03-11T19:05:42Z) - Training ELECTRA Augmented with Multi-word Selection [53.77046731238381]
We present a new text encoder pre-training method that improves ELECTRA based on multi-task learning.
Specifically, we train the discriminator to simultaneously detect replaced tokens and select original tokens from candidate sets.
arXiv Detail & Related papers (2021-05-31T23:19:00Z) - Large-Scale Training System for 100-Million Classification at Alibaba [43.58719630882661]
extreme classification has become an essential topic for deep learning.
It is very challenging to train a deep model with millions of classes due to the memory and explosion in the last output layer.
We build a hybrid parallel training framework to make the training process feasible.
Second, we propose a novel softmax variation named KNN softmax, which reduces both the GPU memory consumption and computation costs.
arXiv Detail & Related papers (2021-02-09T06:53:31Z) - EarlyBERT: Efficient BERT Training via Early-bird Lottery Tickets [106.79387235014379]
EarlyBERT is a general computationally-efficient training algorithm applicable to both pre-training and fine-tuning of large-scale language models.
We are the first to identify structured winning tickets in the early stage of BERT training, and use them for efficient training.
EarlyBERT easily achieves comparable performance to standard BERT with 3545% less training time.
arXiv Detail & Related papers (2020-12-31T20:38:20Z) - FracTrain: Fractionally Squeezing Bit Savings Both Temporally and
Spatially for Efficient DNN Training [81.85361544720885]
We propose FracTrain that integrates progressive fractional quantization which gradually increases the precision of activations, weights, and gradients.
FracTrain reduces computational cost and hardware-quantified energy/latency of DNN training while achieving a comparable or better (-0.12%+1.87%) accuracy.
arXiv Detail & Related papers (2020-12-24T05:24:10Z) - Accelerating Training of Transformer-Based Language Models with
Progressive Layer Dropping [24.547833264405355]
The proposed method achieves a 24% time reduction on average per sample and allows the pre-training to be 2.5 times faster than the baseline.
While being faster, our pre-trained models are equipped with strong knowledge transferability, achieving comparable and sometimes higher GLUE score than the baseline.
arXiv Detail & Related papers (2020-10-26T06:50:07Z) - Accelerated Large Batch Optimization of BERT Pretraining in 54 minutes [9.213729275749452]
We propose an accelerated gradient method called LANS to improve the efficiency of using large mini-batches for training.
It takes 54 minutes on 192 AWS EC2 P3dn.24xlarge instances to achieve a target F1 score of 90.5 or higher on SQuAD v1.1, achieving the fastest BERT training time in the cloud.
arXiv Detail & Related papers (2020-06-24T05:00:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.