Breaking MLPerf Training: A Case Study on Optimizing BERT
- URL: http://arxiv.org/abs/2402.02447v1
- Date: Sun, 4 Feb 2024 11:12:17 GMT
- Title: Breaking MLPerf Training: A Case Study on Optimizing BERT
- Authors: Yongdeok Kim, Jaehyung Ahn, Myeongwoo Kim, Changin Choi, Heejae Kim,
Narankhuu Tuvshinjargal, Seungwon Lee, Yanzi Zhang, Yuan Pei, Xiongzhan
Linghu, Jingkun Ma, Lin Chen, Yuehua Dai, Sungjoo Yoo
- Abstract summary: We present novel approaches for fast large-scale training of BERT model.
Load balancing is imperative in distributed BERT training since its training are characterized by samples with various lengths.
We propose two new ideas, (1) local presorting based on dataset stratification for load balancing and (2) bucket-wise gradient clipping before allreduce.
- Score: 9.486916730173661
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Speeding up the large-scale distributed training is challenging in that it
requires improving various components of training including load balancing,
communication, optimizers, etc. We present novel approaches for fast
large-scale training of BERT model which individually ameliorates each
component thereby leading to a new level of BERT training performance. Load
balancing is imperative in distributed BERT training since its training
datasets are characterized by samples with various lengths. Communication cost,
which is proportional to the scale of distributed training, needs to be hidden
by useful computation. In addition, the optimizers, e.g., ADAM, LAMB, etc.,
need to be carefully re-evaluated in the context of large-scale distributed
training. We propose two new ideas, (1) local presorting based on dataset
stratification for load balancing and (2) bucket-wise gradient clipping before
allreduce which allows us to benefit from the overlap of gradient computation
and synchronization as well as the fast training of gradient clipping before
allreduce. We also re-evaluate existing optimizers via hyperparameter
optimization and utilize ADAM, which also contributes to fast training via
larger batches than existing methods. Our proposed methods, all combined, give
the fastest MLPerf BERT training of 25.1 (22.3) seconds on 1,024 NVIDIA A100
GPUs, which is 1.33x (1.13x) and 1.57x faster than the other top two (one)
submissions to MLPerf v1.1 (v2.0). Our implementation and evaluation results
are available at MLPerf v1.1~v2.1.
Related papers
- Efficient Neural Network Training via Subset Pretraining [5.352839075466439]
In training neural networks, it is common practice to use partial gradients computed over batches.
The loss minimum of the training set can be expected to be well-approximated by the minima of its subsets.
experiments have confirmed that results equivalent to conventional training can be reached.
arXiv Detail & Related papers (2024-10-21T21:31:12Z) - CoMERA: Computing- and Memory-Efficient Training via Rank-Adaptive Tensor Optimization [10.319009303849109]
Training large AI models such as deep learning recommendation systems and foundation language (or multi-modal) models costs massive GPU and computing time.
CoMERA achieves end-to-end rank-adaptive tensor-compressed training via a multi-objective optimization formulation.
CoMERA is $2times$ faster per training epoch and $9times$ more memory-efficient than GaLore on a tested six-encoder transformer with single-batch training.
arXiv Detail & Related papers (2024-05-23T09:52:15Z) - Distributed Adversarial Training to Robustify Deep Neural Networks at
Scale [100.19539096465101]
Current deep neural networks (DNNs) are vulnerable to adversarial attacks, where adversarial perturbations to the inputs can change or manipulate classification.
To defend against such attacks, an effective approach, known as adversarial training (AT), has been shown to mitigate robust training.
We propose a large-batch adversarial training framework implemented over multiple machines.
arXiv Detail & Related papers (2022-06-13T15:39:43Z) - Online Convolutional Re-parameterization [51.97831675242173]
We present online convolutional re- parameterization (OREPA), a two-stage pipeline, aiming to reduce the huge training overhead by squeezing the complex training-time block into a single convolution.
Compared with the state-of-the-art re-param models, OREPA is able to save the training-time memory cost by about 70% and accelerate the training speed by around 2x.
We also conduct experiments on object detection and semantic segmentation and show consistent improvements on the downstream tasks.
arXiv Detail & Related papers (2022-04-02T09:50:19Z) - Federated Dynamic Sparse Training: Computing Less, Communicating Less,
Yet Learning Better [88.28293442298015]
Federated learning (FL) enables distribution of machine learning workloads from the cloud to resource-limited edge devices.
We develop, implement, and experimentally validate a novel FL framework termed Federated Dynamic Sparse Training (FedDST)
FedDST is a dynamic process that extracts and trains sparse sub-networks from the target full network.
arXiv Detail & Related papers (2021-12-18T02:26:38Z) - MEST: Accurate and Fast Memory-Economic Sparse Training Framework on the
Edge [72.16021611888165]
This paper proposes a novel Memory-Economic Sparse Training (MEST) framework targeting for accurate and fast execution on edge devices.
The proposed MEST framework consists of enhancements by Elastic Mutation (EM) and Soft Memory Bound (&S)
Our results suggest that unforgettable examples can be identified in-situ even during the dynamic exploration of sparsity masks.
arXiv Detail & Related papers (2021-10-26T21:15:17Z) - Jigsaw Clustering for Unsupervised Visual Representation Learning [68.09280490213399]
We propose a new jigsaw clustering pretext task in this paper.
Our method makes use of information from both intra- and inter-images.
It is even comparable to the contrastive learning methods when only half of training batches are used.
arXiv Detail & Related papers (2021-04-01T08:09:26Z) - Large-Scale Training System for 100-Million Classification at Alibaba [43.58719630882661]
extreme classification has become an essential topic for deep learning.
It is very challenging to train a deep model with millions of classes due to the memory and explosion in the last output layer.
We build a hybrid parallel training framework to make the training process feasible.
Second, we propose a novel softmax variation named KNN softmax, which reduces both the GPU memory consumption and computation costs.
arXiv Detail & Related papers (2021-02-09T06:53:31Z) - Progressively Stacking 2.0: A Multi-stage Layerwise Training Method for
BERT Training Speedup [13.50984315473865]
We propose an efficient multi-stage layerwise training (MSLT) approach to reduce the training time of BERT.
In the proposed training strategy, only top few layers participate in backward computation, while most layers only participate in forward computation.
Experimental results show that the proposed method can achieve more than 110% training speedup without significant performance degradation.
arXiv Detail & Related papers (2020-11-27T10:00:22Z) - Accelerated Large Batch Optimization of BERT Pretraining in 54 minutes [9.213729275749452]
We propose an accelerated gradient method called LANS to improve the efficiency of using large mini-batches for training.
It takes 54 minutes on 192 AWS EC2 P3dn.24xlarge instances to achieve a target F1 score of 90.5 or higher on SQuAD v1.1, achieving the fastest BERT training time in the cloud.
arXiv Detail & Related papers (2020-06-24T05:00:41Z) - Joint Parameter-and-Bandwidth Allocation for Improving the Efficiency of
Partitioned Edge Learning [73.82875010696849]
Machine learning algorithms are deployed at the network edge for training artificial intelligence (AI) models.
This paper focuses on the novel joint design of parameter (computation load) allocation and bandwidth allocation.
arXiv Detail & Related papers (2020-03-10T05:52:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.