Related papers: Breaking MLPerf Training: A Case Study on Optimizing BERT

Breaking MLPerf Training: A Case Study on Optimizing BERT

URL: http://arxiv.org/abs/2402.02447v1
Date: Sun, 4 Feb 2024 11:12:17 GMT
Title: Breaking MLPerf Training: A Case Study on Optimizing BERT
Authors: Yongdeok Kim, Jaehyung Ahn, Myeongwoo Kim, Changin Choi, Heejae Kim, Narankhuu Tuvshinjargal, Seungwon Lee, Yanzi Zhang, Yuan Pei, Xiongzhan Linghu, Jingkun Ma, Lin Chen, Yuehua Dai, Sungjoo Yoo
Abstract summary: We present novel approaches for fast large-scale training of BERT model. Load balancing is imperative in distributed BERT training since its training are characterized by samples with various lengths. We propose two new ideas, (1) local presorting based on dataset stratification for load balancing and (2) bucket-wise gradient clipping before allreduce.
Score: 9.486916730173661
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Speeding up the large-scale distributed training is challenging in that it requires improving various components of training including load balancing, communication, optimizers, etc. We present novel approaches for fast large-scale training of BERT model which individually ameliorates each component thereby leading to a new level of BERT training performance. Load balancing is imperative in distributed BERT training since its training datasets are characterized by samples with various lengths. Communication cost, which is proportional to the scale of distributed training, needs to be hidden by useful computation. In addition, the optimizers, e.g., ADAM, LAMB, etc., need to be carefully re-evaluated in the context of large-scale distributed training. We propose two new ideas, (1) local presorting based on dataset stratification for load balancing and (2) bucket-wise gradient clipping before allreduce which allows us to benefit from the overlap of gradient computation and synchronization as well as the fast training of gradient clipping before allreduce. We also re-evaluate existing optimizers via hyperparameter optimization and utilize ADAM, which also contributes to fast training via larger batches than existing methods. Our proposed methods, all combined, give the fastest MLPerf BERT training of 25.1 (22.3) seconds on 1,024 NVIDIA A100 GPUs, which is 1.33x (1.13x) and 1.57x faster than the other top two (one) submissions to MLPerf v1.1 (v2.0). Our implementation and evaluation results are available at MLPerf v1.1~v2.1.

Related papers

Training Long-Context LLMs Efficiently via Chunk-wise Optimization [60.05884946552877]
We present textitSequential Chunk-wise Optimization (SeCO), a memory-efficient training paradigm that partitions lengthy inputs into manageable chunks.<n>We also introduce textitSparse Chunk-wise Optimization (SpaCO), which reduces computational overhead by selectively propagating gradients to specific chunks.<n>SpaCO decouples the computational cost of backpropagation from the context length, enabling training time to gradually converge to inference time as sequences become longer.
arXiv Detail & Related papers (2025-05-22T14:11:34Z)
Efficient Neural Network Training via Subset Pretraining [5.352839075466439]
In training neural networks, it is common practice to use partial gradients computed over batches. The loss minimum of the training set can be expected to be well-approximated by the minima of its subsets. experiments have confirmed that results equivalent to conventional training can be reached.
arXiv Detail & Related papers (2024-10-21T21:31:12Z)
Improving Pretraining Data Using Perplexity Correlations [56.41097718862742]
We present a framework that selects high-quality pretraining data without any LLM training of our own. We build a new statistical framework for data selection centered around estimates of perplexity-benchmark correlations. Our approach outperforms DSIR on every benchmark, while matching the best data selector found in DataComp-LM.
arXiv Detail & Related papers (2024-09-09T17:23:29Z)
CoMERA: Computing- and Memory-Efficient Training via Rank-Adaptive Tensor Optimization [10.319009303849109]
Training large AI models such as deep learning recommendation systems and foundation language (or multi-modal) models costs massive GPU and computing time. CoMERA achieves end-to-end rank-adaptive tensor-compressed training via a multi-objective optimization formulation. CoMERA is $2times$ faster per training epoch and $9times$ more memory-efficient than GaLore on a tested six-encoder transformer with single-batch training.
arXiv Detail & Related papers (2024-05-23T09:52:15Z)
Balance is Essence: Accelerating Sparse Training via Adaptive Gradient Correction [29.61757744974324]
Deep neural networks require significant memory and computation costs. Sparse training is one of the most common techniques to reduce these costs. In this work, we aim to overcome this problem and achieve space-time co-efficiency.
arXiv Detail & Related papers (2023-01-09T18:50:03Z)
Distributed Adversarial Training to Robustify Deep Neural Networks at Scale [100.19539096465101]
Current deep neural networks (DNNs) are vulnerable to adversarial attacks, where adversarial perturbations to the inputs can change or manipulate classification. To defend against such attacks, an effective approach, known as adversarial training (AT), has been shown to mitigate robust training. We propose a large-batch adversarial training framework implemented over multiple machines.
arXiv Detail & Related papers (2022-06-13T15:39:43Z)
Online Convolutional Re-parameterization [51.97831675242173]
We present online convolutional re- parameterization (OREPA), a two-stage pipeline, aiming to reduce the huge training overhead by squeezing the complex training-time block into a single convolution. Compared with the state-of-the-art re-param models, OREPA is able to save the training-time memory cost by about 70% and accelerate the training speed by around 2x. We also conduct experiments on object detection and semantic segmentation and show consistent improvements on the downstream tasks.
arXiv Detail & Related papers (2022-04-02T09:50:19Z)
Federated Dynamic Sparse Training: Computing Less, Communicating Less, Yet Learning Better [88.28293442298015]
Federated learning (FL) enables distribution of machine learning workloads from the cloud to resource-limited edge devices. We develop, implement, and experimentally validate a novel FL framework termed Federated Dynamic Sparse Training (FedDST) FedDST is a dynamic process that extracts and trains sparse sub-networks from the target full network.
arXiv Detail & Related papers (2021-12-18T02:26:38Z)
MEST: Accurate and Fast Memory-Economic Sparse Training Framework on the Edge [72.16021611888165]
This paper proposes a novel Memory-Economic Sparse Training (MEST) framework targeting for accurate and fast execution on edge devices. The proposed MEST framework consists of enhancements by Elastic Mutation (EM) and Soft Memory Bound (&S) Our results suggest that unforgettable examples can be identified in-situ even during the dynamic exploration of sparsity masks.
arXiv Detail & Related papers (2021-10-26T21:15:17Z)
Jigsaw Clustering for Unsupervised Visual Representation Learning [68.09280490213399]
We propose a new jigsaw clustering pretext task in this paper. Our method makes use of information from both intra- and inter-images. It is even comparable to the contrastive learning methods when only half of training batches are used.
arXiv Detail & Related papers (2021-04-01T08:09:26Z)
Large-Scale Training System for 100-Million Classification at Alibaba [43.58719630882661]
extreme classification has become an essential topic for deep learning. It is very challenging to train a deep model with millions of classes due to the memory and explosion in the last output layer. We build a hybrid parallel training framework to make the training process feasible. Second, we propose a novel softmax variation named KNN softmax, which reduces both the GPU memory consumption and computation costs.
arXiv Detail & Related papers (2021-02-09T06:53:31Z)
Progressively Stacking 2.0: A Multi-stage Layerwise Training Method for BERT Training Speedup [13.50984315473865]
We propose an efficient multi-stage layerwise training (MSLT) approach to reduce the training time of BERT. In the proposed training strategy, only top few layers participate in backward computation, while most layers only participate in forward computation. Experimental results show that the proposed method can achieve more than 110% training speedup without significant performance degradation.
arXiv Detail & Related papers (2020-11-27T10:00:22Z)
Accelerated Large Batch Optimization of BERT Pretraining in 54 minutes [9.213729275749452]
We propose an accelerated gradient method called LANS to improve the efficiency of using large mini-batches for training. It takes 54 minutes on 192 AWS EC2 P3dn.24xlarge instances to achieve a target F1 score of 90.5 or higher on SQuAD v1.1, achieving the fastest BERT training time in the cloud.
arXiv Detail & Related papers (2020-06-24T05:00:41Z)
Joint Parameter-and-Bandwidth Allocation for Improving the Efficiency of Partitioned Edge Learning [73.82875010696849]
Machine learning algorithms are deployed at the network edge for training artificial intelligence (AI) models. This paper focuses on the novel joint design of parameter (computation load) allocation and bandwidth allocation.
arXiv Detail & Related papers (2020-03-10T05:52:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.