Hardware Scaling Trends and Diminishing Returns in Large-Scale Distributed Training
- URL: http://arxiv.org/abs/2411.13055v1
- Date: Wed, 20 Nov 2024 06:05:11 GMT
- Title: Hardware Scaling Trends and Diminishing Returns in Large-Scale Distributed Training
- Authors: Jared Fernandez, Luca Wehrstedt, Leonid Shamis, Mostafa Elhoushi, Kalyan Saladi, Yonatan Bisk, Emma Strubell, Jacob Kahn,
- Abstract summary: We show that careful consideration of hardware configuration and parallelization strategy is critical for effective scaling of model size, training data, and total computation.
We conduct an extensive empirical study of the performance of large-scale LLM training workloads across model size, hardware configurations, and distributed parallelization strategies.
- Score: 29.44470664154098
- License:
- Abstract: Dramatic increases in the capabilities of neural network models in recent years are driven by scaling model size, training data, and corresponding computational resources. To develop the exceedingly large networks required in modern applications, such as large language models (LLMs), model training is distributed across tens of thousands of hardware accelerators (e.g. GPUs), requiring orchestration of computation and communication across large computing clusters. In this work, we demonstrate that careful consideration of hardware configuration and parallelization strategy is critical for effective (i.e. compute- and cost-efficient) scaling of model size, training data, and total computation. We conduct an extensive empirical study of the performance of large-scale LLM training workloads across model size, hardware configurations, and distributed parallelization strategies. We demonstrate that: (1) beyond certain scales, overhead incurred from certain distributed communication strategies leads parallelization strategies previously thought to be sub-optimal in fact become preferable; and (2) scaling the total number of accelerators for large model training quickly yields diminishing returns even when hardware and parallelization strategies are properly optimized, implying poor marginal performance per additional unit of power or GPU-hour.
Related papers
- Performance Modeling and Workload Analysis of Distributed Large Language Model Training and Inference [2.2231908139555734]
We propose a general performance modeling methodology and workload analysis of distributed LLM training and inference.
We validate our performance predictions with published data from literature and relevant industry vendors (e.g., NVIDIA)
arXiv Detail & Related papers (2024-07-19T19:49:05Z) - Partitioned Neural Network Training via Synthetic Intermediate Labels [0.0]
GPU memory constraints have become a notable bottleneck in training such sizable models.
This study advocates partitioning the model across GPU and generating synthetic intermediate labels to train individual segments.
This approach results in a more efficient training process that minimizes data communication while maintaining model accuracy.
arXiv Detail & Related papers (2024-03-17T13:06:29Z) - Efficient Parallelization Layouts for Large-Scale Distributed Model Training [17.16249954009967]
We conduct a comprehensive study of possible training configurations for large language models.
We find that using a micro-batch size of 1 usually enables the most efficient training layouts.
Our most efficient configurations enable us to achieve state-of-the-art training efficiency results over a range of model sizes.
arXiv Detail & Related papers (2023-11-09T18:59:38Z) - MAD Max Beyond Single-Node: Enabling Large Machine Learning Model Acceleration on Distributed Systems [6.8519529064678375]
Training and deploying large-scale machine learning models is time-consuming, requires significant distributed computing infrastructures, and incurs high operational costs.
To minimize this outstanding communication latency and other inherent at-scale inefficiencies, we introduce an agile performance modeling framework, MAD-Max.
This framework is designed to optimize parallelization strategies and facilitate hardware-software co-design opportunities.
arXiv Detail & Related papers (2023-10-04T13:00:53Z) - Towards a Better Theoretical Understanding of Independent Subnetwork Training [56.24689348875711]
We take a closer theoretical look at Independent Subnetwork Training (IST)
IST is a recently proposed and highly effective technique for solving the aforementioned problems.
We identify fundamental differences between IST and alternative approaches, such as distributed methods with compressed communication.
arXiv Detail & Related papers (2023-06-28T18:14:22Z) - On Efficient Training of Large-Scale Deep Learning Models: A Literature
Review [90.87691246153612]
The field of deep learning has witnessed significant progress, particularly in computer vision (CV), natural language processing (NLP), and speech.
The use of large-scale models trained on vast amounts of data holds immense promise for practical applications.
With the increasing demands on computational capacity, a comprehensive summarization on acceleration techniques of training deep learning models is still much anticipated.
arXiv Detail & Related papers (2023-04-07T11:13:23Z) - Partitioning Distributed Compute Jobs with Reinforcement Learning and
Graph Neural Networks [58.720142291102135]
Large-scale machine learning models are bringing advances to a broad range of fields.
Many of these models are too large to be trained on a single machine, and must be distributed across multiple devices.
We show that maximum parallelisation is sub-optimal in relation to user-critical metrics such as throughput and blocking rate.
arXiv Detail & Related papers (2023-01-31T17:41:07Z) - DANCE: DAta-Network Co-optimization for Efficient Segmentation Model
Training and Inference [85.02494022662505]
DANCE is an automated simultaneous data-network co-optimization for efficient segmentation model training and inference.
It integrates automated data slimming which adaptively downsamples/drops input images and controls their corresponding contribution to the training loss guided by the images' spatial complexity.
Experiments and ablating studies demonstrate that DANCE can achieve "all-win" towards efficient segmentation.
arXiv Detail & Related papers (2021-07-16T04:58:58Z) - Parallel Training of Deep Networks with Local Updates [84.30918922367442]
Local parallelism is a framework which parallelizes training of individual layers in deep networks by replacing global backpropagation with truncated layer-wise backpropagation.
We show results in both vision and language domains across a diverse set of architectures, and find that local parallelism is particularly effective in the high-compute regime.
arXiv Detail & Related papers (2020-12-07T16:38:45Z) - Understanding the Effects of Data Parallelism and Sparsity on Neural
Network Training [126.49572353148262]
We study two factors in neural network training: data parallelism and sparsity.
Despite their promising benefits, understanding of their effects on neural network training remains elusive.
arXiv Detail & Related papers (2020-03-25T10:49:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.