Optimizing Distributed Training on Frontier for Large Language Models
- URL: http://arxiv.org/abs/2312.12705v2
- Date: Thu, 21 Dec 2023 22:06:04 GMT
- Title: Optimizing Distributed Training on Frontier for Large Language Models
- Authors: Sajal Dash, Isaac Lyngaas, Junqi Yin, Xiao Wang, Romain Egele, Guojing
Cong, Feiyi Wang, Prasanna Balaprakash
- Abstract summary: Training large language models (LLMs) with billions of parameters poses significant challenges and requires considerable computational resources.
This research explores efficient distributed training strategies to extract this computation from Frontier, the world's first exascale supercomputer.
- Score: 7.251642875697334
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have demonstrated remarkable success as
foundational models, benefiting various downstream applications through
fine-tuning. Recent studies on loss scaling have demonstrated the superior
performance of larger LLMs compared to their smaller counterparts.
Nevertheless, training LLMs with billions of parameters poses significant
challenges and requires considerable computational resources. For example,
training a one trillion parameter GPT-style model on 20 trillion tokens
requires a staggering 120 million exaflops of computation. This research
explores efficient distributed training strategies to extract this computation
from Frontier, the world's first exascale supercomputer dedicated to open
science. We enable and investigate various model and data parallel training
techniques, such as tensor parallelism, pipeline parallelism, and sharded data
parallelism, to facilitate training a trillion-parameter model on Frontier. We
empirically assess these techniques and their associated parameters to
determine their impact on memory footprint, communication latency, and GPU's
computational efficiency. We analyze the complex interplay among these
techniques and find a strategy to combine them to achieve high throughput
through hyperparameter tuning. We have identified efficient strategies for
training large LLMs of varying sizes through empirical analysis and
hyperparameter tuning. For 22 Billion, 175 Billion, and 1 Trillion parameters,
we achieved GPU throughputs of $38.38\%$, $36.14\%$, and $31.96\%$,
respectively. For the training of the 175 Billion parameter model and the 1
Trillion parameter model, we achieved $100\%$ weak scaling efficiency on 1024
and 3072 MI250X GPUs, respectively. We also achieved strong scaling
efficiencies of $89\%$ and $87\%$ for these two models.
Related papers
- Training Compute-Optimal Protein Language Models [48.79416103951816]
Most protein language models are trained with extensive compute resources until performance gains plateau.
Our investigation is grounded in a massive dataset consisting of 939 million protein sequences.
We trained over 300 models ranging from 3.5 million to 10.7 billion parameters on 5 to 200 billion unique tokens.
arXiv Detail & Related papers (2024-11-04T14:58:37Z) - Are Protein Language Models Compute Optimal? [0.0]
We investigate the optimal ratio between model parameters and training tokens within a fixed compute budget.
Our study reveals that pLM sizes scale sublinearly with compute budget, showing diminishing returns in performance as model size increases.
This work paves the way towards more compute-efficient pLMs, democratizing their training and practical application in computational biology.
arXiv Detail & Related papers (2024-06-11T13:32:11Z) - Time-, Memory- and Parameter-Efficient Visual Adaptation [75.28557015773217]
We propose an adaptation method which does not backpropagate gradients through the backbone.
We achieve this by designing a lightweight network in parallel that operates on features from the frozen, pretrained backbone.
arXiv Detail & Related papers (2024-02-05T10:55:47Z) - Scaling Laws for Sparsely-Connected Foundation Models [70.41266138010657]
We explore the impact of parameter sparsity on the scaling behavior of Transformers trained on massive datasets.
We identify the first scaling law describing the relationship between weight sparsity, number of non-zero parameters, and amount of training data.
arXiv Detail & Related papers (2023-09-15T16:29:27Z) - A 4D Hybrid Algorithm to Scale Parallel Training to Thousands of GPUs [1.7481226034111275]
This paper introduces a four-dimensional (4D) approach to optimize communication in parallel training.
AxoNN surpasses Megatron-LM, a state-of-the-art framework, by a significant 26%.
It achieves a significantly high 57% of the theoretical peak FLOP/s or 182 PFLOP/s in total.
arXiv Detail & Related papers (2023-05-22T22:41:49Z) - Persia: A Hybrid System Scaling Deep Learning Based Recommenders up to
100 Trillion Parameters [36.1028179125367]
Deep learning models have dominated the current landscape of production recommender systems.
Recent years have witnessed an exponential growth of the model scale--from Google's 2016 model with 1 billion parameters to the latest Facebook's model with 12 trillion parameters.
However, the training of such models is challenging even within industrial scale data centers.
arXiv Detail & Related papers (2021-11-10T19:40:25Z) - M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion
Parameter Pretraining [55.16088793437898]
Training extreme-scale models requires enormous amounts of computes and memory footprint.
We propose a simple training strategy called "Pseudo-to-Real" for high-memory-footprint-required large models.
arXiv Detail & Related papers (2021-10-08T04:24:51Z) - CPM-2: Large-scale Cost-effective Pre-trained Language Models [71.59893315671997]
We present a suite of cost-effective techniques for the use of PLMs to deal with the efficiency issues of pre-training, fine-tuning, and inference.
We introduce knowledge inheritance to accelerate the pre-training process by exploiting existing PLMs instead of training models from scratch.
We implement a new inference toolkit, namely InfMoE, for using large-scale PLMs with limited computational resources.
arXiv Detail & Related papers (2021-06-20T15:43:54Z) - Exploring Sparse Expert Models and Beyond [51.90860155810848]
Mixture-of-Experts (MoE) models can achieve promising results with outrageous large amount of parameters but constant computation cost.
We propose a simple method called expert prototyping that splits experts into different prototypes and applies $k$ top-$1$ routing.
This strategy improves the model quality but maintains constant computational costs, and our further exploration on extremely large-scale models reflects that it is more effective in training larger models.
arXiv Detail & Related papers (2021-05-31T16:12:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.