Efficient Large-Scale Language Model Training on GPU Clusters
- URL: http://arxiv.org/abs/2104.04473v1
- Date: Fri, 9 Apr 2021 16:43:11 GMT
- Title: Efficient Large-Scale Language Model Training on GPU Clusters
- Authors: Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley,
Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti,
Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, Matei Zaharia
- Abstract summary: Large language models have led to state-of-the-art accuracies across a range of tasks.
Memory capacity is limited, making it impossible to fit large models on a single GPU.
The number of compute operations required to train these models can result in unrealistically long training times.
- Score: 19.00915720435389
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models have led to state-of-the-art accuracies across a range
of tasks. However, training these large models efficiently is challenging for
two reasons: a) GPU memory capacity is limited, making it impossible to fit
large models on a single GPU or even on a multi-GPU server; and b) the number
of compute operations required to train these models can result in
unrealistically long training times. New methods of model parallelism such as
tensor and pipeline parallelism have been proposed to address these challenges;
unfortunately, naive usage leads to fundamental scaling issues at thousands of
GPUs due to various reasons, e.g., expensive cross-node communication or idle
periods waiting on other devices.
In this work, we show how to compose different types of parallelism methods
(tensor, pipeline, and data paralleism) to scale to thousands of GPUs,
achieving a two-order-of-magnitude increase in the sizes of models we can
efficiently train compared to existing systems. We discuss various
implementations of pipeline parallelism and propose a novel schedule that can
improve throughput by more than 10% with comparable memory footprint compared
to previously-proposed approaches. We quantitatively study the trade-offs
between tensor, pipeline, and data parallelism, and provide intuition as to how
to configure distributed training of a large model. The composition of these
techniques allows us to perform training iterations on a model with 1 trillion
parameters at 502 petaFLOP/s on 3072 GPUs with achieved per-GPU throughput of
52% of peak; previous efforts to train similar-sized models achieve much lower
throughput (36% of theoretical peak). Our code has been open-sourced at
https://github.com/nvidia/megatron-lm.
Related papers
- SWARM Parallelism: Training Large Models Can Be Surprisingly
Communication-Efficient [69.61083127540776]
Deep learning applications benefit from using large models with billions of parameters.
Training these models is notoriously expensive due to the need for specialized HPC clusters.
We consider alternative setups for training large models: using cheap "preemptible" instances or pooling existing resources from multiple regions.
arXiv Detail & Related papers (2023-01-27T18:55:19Z) - Does compressing activations help model parallel training? [64.59298055364336]
We present the first empirical study on the effectiveness of compression methods for model parallelism.
We implement and evaluate three common classes of compression algorithms.
We evaluate these methods across more than 160 settings and 8 popular datasets.
arXiv Detail & Related papers (2023-01-06T18:58:09Z) - Cramming: Training a Language Model on a Single GPU in One Day [64.18297923419627]
Recent trends in language modeling have focused on increasing performance through scaling.
We investigate the downstream performance achievable with a transformer-based language model trained completely from scratch with masked language modeling for a single day on a single consumer GPU.
We provide evidence that even in this constrained setting, performance closely follows scaling laws observed in large-compute settings.
arXiv Detail & Related papers (2022-12-28T18:59:28Z) - Merak: An Efficient Distributed DNN Training Framework with Automated 3D
Parallelism for Giant Foundation Models [14.903847751841221]
We propose Merak, an automated 3D parallelism deep learning training framework with high resource utilization.
Merak automatically deploys with an automatic model partitioner, which uses a graph sharding algorithm on a proxy representation of the model.
Merak can speedup the training performance over the state-of-the-art 3D parallelism frameworks of models with 1.5, 2.5, 8.3, and 20 billion parameters by up to 1.42X, 1.39X, 1.43X, and 1.61X, respectively.
arXiv Detail & Related papers (2022-06-10T09:15:48Z) - Hydra: A System for Large Multi-Model Deep Learning [3.571623412954477]
We present'model spilling', a technique aimed at models such as Transformers and CNNs to move groups of layers between DRAM and GPU memory.
We then present a set of novel techniques leveraging spilling to raise efficiency for multi-model training workloads.
Experiments with real benchmark workloads show that HYDRA is over 7x faster than regular model parallelism and over 50% faster than state-of-the-art industrial tools for pipeline parallelism.
arXiv Detail & Related papers (2021-10-16T18:13:57Z) - Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous
Multi-GPU Servers [65.60007071024629]
We show that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
We show experimentally that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
arXiv Detail & Related papers (2021-10-13T20:58:15Z) - M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion
Parameter Pretraining [55.16088793437898]
Training extreme-scale models requires enormous amounts of computes and memory footprint.
We propose a simple training strategy called "Pseudo-to-Real" for high-memory-footprint-required large models.
arXiv Detail & Related papers (2021-10-08T04:24:51Z) - Maximizing Parallelism in Distributed Training for Huge Neural Networks [7.471658821614902]
We introduce a 3-dimensional model parallelism for expediting huge language models.
Our approach presents smaller memory and communication cost than existing state-of-the-art 1-D and 2-D model parallelism.
arXiv Detail & Related papers (2021-05-30T07:41:08Z) - ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep
Learning [9.322987670900778]
ZeRO-Infinity can fit models with tens and even hundreds of trillions of parameters for training on current generation GPU clusters.
It can be used to fine-tune trillion parameter models on a single NVIDIA DGX-2 node, making large models more accessible.
arXiv Detail & Related papers (2021-04-16T02:22:12Z) - TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale
Language Models [60.23234205219347]
TeraPipe is a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of Transformer-based language models.
We show that TeraPipe can speed up the training by 5.0x for the largest GPT-3 model with 175 billion parameters on an AWS cluster.
arXiv Detail & Related papers (2021-02-16T07:34:32Z) - ZeRO-Offload: Democratizing Billion-Scale Model Training [16.43347399073034]
ZeRO-Offload enables large model training by offloading data and compute to CPU.
It can train models with over 13 billion parameters on a single GPU, a 10x increase in size compared to popular framework such as PyTorch.
arXiv Detail & Related papers (2021-01-18T02:11:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.