Accelerated Large Batch Optimization of BERT Pretraining in 54 minutes
- URL: http://arxiv.org/abs/2006.13484v2
- Date: Fri, 18 Sep 2020 08:46:52 GMT
- Title: Accelerated Large Batch Optimization of BERT Pretraining in 54 minutes
- Authors: Shuai Zheng and Haibin Lin and Sheng Zha and Mu Li
- Abstract summary: We propose an accelerated gradient method called LANS to improve the efficiency of using large mini-batches for training.
It takes 54 minutes on 192 AWS EC2 P3dn.24xlarge instances to achieve a target F1 score of 90.5 or higher on SQuAD v1.1, achieving the fastest BERT training time in the cloud.
- Score: 9.213729275749452
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: BERT has recently attracted a lot of attention in natural language
understanding (NLU) and achieved state-of-the-art results in various NLU tasks.
However, its success requires large deep neural networks and huge amount of
data, which result in long training time and impede development progress. Using
stochastic gradient methods with large mini-batch has been advocated as an
efficient tool to reduce the training time. Along this line of research, LAMB
is a prominent example that reduces the training time of BERT from 3 days to 76
minutes on a TPUv3 Pod. In this paper, we propose an accelerated gradient
method called LANS to improve the efficiency of using large mini-batches for
training. As the learning rate is theoretically upper bounded by the inverse of
the Lipschitz constant of the function, one cannot always reduce the number of
optimization iterations by selecting a larger learning rate. In order to use
larger mini-batch size without accuracy loss, we develop a new learning rate
scheduler that overcomes the difficulty of using large learning rate. Using the
proposed LANS method and the learning rate scheme, we scaled up the mini-batch
sizes to 96K and 33K in phases 1 and 2 of BERT pretraining, respectively. It
takes 54 minutes on 192 AWS EC2 P3dn.24xlarge instances to achieve a target F1
score of 90.5 or higher on SQuAD v1.1, achieving the fastest BERT training time
in the cloud.
Related papers
- Breaking MLPerf Training: A Case Study on Optimizing BERT [9.486916730173661]
We present novel approaches for fast large-scale training of BERT model.
Load balancing is imperative in distributed BERT training since its training are characterized by samples with various lengths.
We propose two new ideas, (1) local presorting based on dataset stratification for load balancing and (2) bucket-wise gradient clipping before allreduce.
arXiv Detail & Related papers (2024-02-04T11:12:17Z) - Towards Memory- and Time-Efficient Backpropagation for Training Spiking
Neural Networks [70.75043144299168]
Spiking Neural Networks (SNNs) are promising energy-efficient models for neuromorphic computing.
We propose the Spatial Learning Through Time (SLTT) method that can achieve high performance while greatly improving training efficiency.
Our method achieves state-of-the-art accuracy on ImageNet, while the memory cost and training time are reduced by more than 70% and 50%, respectively, compared with BPTT.
arXiv Detail & Related papers (2023-02-28T05:01:01Z) - Influence-Based Mini-Batching for Graph Neural Networks [0.0]
We propose influence-based mini-batching for graph neural networks.
IBMB accelerates inference by up to 130x compared to previous methods.
This results in up to 18x faster training per epoch and up to 17x faster convergence per runtime compared to previous methods.
arXiv Detail & Related papers (2022-12-18T13:27:01Z) - Q-Ensemble for Offline RL: Don't Scale the Ensemble, Scale the Batch
Size [58.762959061522736]
We show that scaling mini-batch sizes with appropriate learning rate adjustments can speed up the training process by orders of magnitude.
We show that scaling the mini-batch size and naively adjusting the learning rate allows for (1) a reduced size of the Q-ensemble, (2) stronger penalization of out-of-distribution actions, and (3) improved convergence time.
arXiv Detail & Related papers (2022-11-20T21:48:25Z) - Automated Learning Rate Scheduler for Large-batch Training [24.20872850681828]
Large-batch training has been essential in leveraging large-scale datasets and models in deep learning.
It often requires a specially designed learning rate (LR) schedule to achieve a comparable level of performance as in smaller batch training.
We propose an automated LR scheduling algorithm which is effective for neural network training with a large batch size under the given epoch budget.
arXiv Detail & Related papers (2021-07-13T05:23:13Z) - Concurrent Adversarial Learning for Large-Batch Training [83.55868483681748]
Adversarial learning is a natural choice for smoothing the decision surface and biasing towards a flat region.
We propose a novel Concurrent Adversarial Learning (ConAdv) method that decouples the sequential gradient computations in adversarial learning by utilizing staled parameters.
This is the first work successfully scales ResNet-50 training batch size to 96K.
arXiv Detail & Related papers (2021-06-01T04:26:02Z) - EarlyBERT: Efficient BERT Training via Early-bird Lottery Tickets [106.79387235014379]
EarlyBERT is a general computationally-efficient training algorithm applicable to both pre-training and fine-tuning of large-scale language models.
We are the first to identify structured winning tickets in the early stage of BERT training, and use them for efficient training.
EarlyBERT easily achieves comparable performance to standard BERT with 3545% less training time.
arXiv Detail & Related papers (2020-12-31T20:38:20Z) - FracTrain: Fractionally Squeezing Bit Savings Both Temporally and
Spatially for Efficient DNN Training [81.85361544720885]
We propose FracTrain that integrates progressive fractional quantization which gradually increases the precision of activations, weights, and gradients.
FracTrain reduces computational cost and hardware-quantified energy/latency of DNN training while achieving a comparable or better (-0.12%+1.87%) accuracy.
arXiv Detail & Related papers (2020-12-24T05:24:10Z) - Progressively Stacking 2.0: A Multi-stage Layerwise Training Method for
BERT Training Speedup [13.50984315473865]
We propose an efficient multi-stage layerwise training (MSLT) approach to reduce the training time of BERT.
In the proposed training strategy, only top few layers participate in backward computation, while most layers only participate in forward computation.
Experimental results show that the proposed method can achieve more than 110% training speedup without significant performance degradation.
arXiv Detail & Related papers (2020-11-27T10:00:22Z) - Large Batch Training Does Not Need Warmup [111.07680619360528]
Training deep neural networks using a large batch size has shown promising results and benefits many real-world applications.
In this paper, we propose a novel Complete Layer-wise Adaptive Rate Scaling (CLARS) algorithm for large-batch training.
Based on our analysis, we bridge the gap and illustrate the theoretical insights for three popular large-batch training techniques.
arXiv Detail & Related papers (2020-02-04T23:03:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.