Optimizing Distributed Training on Frontier for Large Language Models
- URL: http://arxiv.org/abs/2312.12705v2
- Date: Thu, 21 Dec 2023 22:06:04 GMT
- Title: Optimizing Distributed Training on Frontier for Large Language Models
- Authors: Sajal Dash, Isaac Lyngaas, Junqi Yin, Xiao Wang, Romain Egele, Guojing
Cong, Feiyi Wang, Prasanna Balaprakash
- Abstract summary: Training large language models (LLMs) with billions of parameters poses significant challenges and requires considerable computational resources.
This research explores efficient distributed training strategies to extract this computation from Frontier, the world's first exascale supercomputer.
- Score: 7.251642875697334
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have demonstrated remarkable success as
foundational models, benefiting various downstream applications through
fine-tuning. Recent studies on loss scaling have demonstrated the superior
performance of larger LLMs compared to their smaller counterparts.
Nevertheless, training LLMs with billions of parameters poses significant
challenges and requires considerable computational resources. For example,
training a one trillion parameter GPT-style model on 20 trillion tokens
requires a staggering 120 million exaflops of computation. This research
explores efficient distributed training strategies to extract this computation
from Frontier, the world's first exascale supercomputer dedicated to open
science. We enable and investigate various model and data parallel training
techniques, such as tensor parallelism, pipeline parallelism, and sharded data
parallelism, to facilitate training a trillion-parameter model on Frontier. We
empirically assess these techniques and their associated parameters to
determine their impact on memory footprint, communication latency, and GPU's
computational efficiency. We analyze the complex interplay among these
techniques and find a strategy to combine them to achieve high throughput
through hyperparameter tuning. We have identified efficient strategies for
training large LLMs of varying sizes through empirical analysis and
hyperparameter tuning. For 22 Billion, 175 Billion, and 1 Trillion parameters,
we achieved GPU throughputs of $38.38\%$, $36.14\%$, and $31.96\%$,
respectively. For the training of the 175 Billion parameter model and the 1
Trillion parameter model, we achieved $100\%$ weak scaling efficiency on 1024
and 3072 MI250X GPUs, respectively. We also achieved strong scaling
efficiencies of $89\%$ and $87\%$ for these two models.
Related papers
- LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive.
Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones.
We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z) - CoLA: Compute-Efficient Pre-Training of LLMs via Low-Rank Activation [17.807249890437767]
We introduce CoLA and its memory-efficient implementation, CoLA-M.
We leverage the low-rank structure observed widely in model activations to reduce model size, boost model capacity and training efficiency.
Experiments on LLaMA models with 60 million to 7 billion parameters show that CoLA reduces the computing cost by $bf 2pmbtimes$ and improves training throughput by $bf 1.86pmbtimes$ while maintaining full-rank level performance.
arXiv Detail & Related papers (2025-02-16T01:05:16Z) - Democratizing AI: Open-source Scalable LLM Training on GPU-based Supercomputers [65.35142508909892]
We present a novel four-dimensional hybrid parallel algorithm implemented in a highly scalable, portable, open-source framework called AxoNN.
We demonstrate fine-tuning of a 405-billion parameter LLM using AxoNN on Frontier.
arXiv Detail & Related papers (2025-02-12T06:05:52Z) - Are Protein Language Models Compute Optimal? [0.0]
We investigate the optimal ratio between model parameters and training tokens within a fixed compute budget.
Our study reveals that pLM sizes scale sublinearly with compute budget, showing diminishing returns in performance as model size increases.
This work paves the way towards more compute-efficient pLMs, democratizing their training and practical application in computational biology.
arXiv Detail & Related papers (2024-06-11T13:32:11Z) - Scaling Laws for Sparsely-Connected Foundation Models [70.41266138010657]
We explore the impact of parameter sparsity on the scaling behavior of Transformers trained on massive datasets.
We identify the first scaling law describing the relationship between weight sparsity, number of non-zero parameters, and amount of training data.
arXiv Detail & Related papers (2023-09-15T16:29:27Z) - M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion
Parameter Pretraining [55.16088793437898]
Training extreme-scale models requires enormous amounts of computes and memory footprint.
We propose a simple training strategy called "Pseudo-to-Real" for high-memory-footprint-required large models.
arXiv Detail & Related papers (2021-10-08T04:24:51Z) - CPM-2: Large-scale Cost-effective Pre-trained Language Models [71.59893315671997]
We present a suite of cost-effective techniques for the use of PLMs to deal with the efficiency issues of pre-training, fine-tuning, and inference.
We introduce knowledge inheritance to accelerate the pre-training process by exploiting existing PLMs instead of training models from scratch.
We implement a new inference toolkit, namely InfMoE, for using large-scale PLMs with limited computational resources.
arXiv Detail & Related papers (2021-06-20T15:43:54Z) - Exploring Sparse Expert Models and Beyond [51.90860155810848]
Mixture-of-Experts (MoE) models can achieve promising results with outrageous large amount of parameters but constant computation cost.
We propose a simple method called expert prototyping that splits experts into different prototypes and applies $k$ top-$1$ routing.
This strategy improves the model quality but maintains constant computational costs, and our further exploration on extremely large-scale models reflects that it is more effective in training larger models.
arXiv Detail & Related papers (2021-05-31T16:12:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.