Training Large Language Models Efficiently with Sparsity and Dataflow
- URL: http://arxiv.org/abs/2304.05511v1
- Date: Tue, 11 Apr 2023 21:37:13 GMT
- Title: Training Large Language Models Efficiently with Sparsity and Dataflow
- Authors: Venkat Srinivasan, Darshan Gandhi, Urmish Thakker and Raghu Prabhakar
- Abstract summary: This paper demonstrates an end-to-end training flow on a large language model - 13 billion GPT - using sparsity and dataflow.
We show that we can successfully train GPT 13B to the same quality as the dense GPT 13B model, while achieving an end-end speedup of 4.5x over dense A100 baseline.
- Score: 3.1780195670658378
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large foundation language models have shown their versatility in being able
to be adapted to perform a wide variety of downstream tasks, such as text
generation, sentiment analysis, semantic search etc. However, training such
large foundational models is a non-trivial exercise that requires a significant
amount of compute power and expertise from machine learning and systems
experts. As models get larger, these demands are only increasing. Sparsity is a
promising technique to relieve the compute requirements for training. However,
sparsity introduces new challenges in training the sparse model to the same
quality as the dense counterparts. Furthermore, sparsity drops the operation
intensity and introduces irregular memory access patterns that makes it
challenging to efficiently utilize compute resources. This paper demonstrates
an end-to-end training flow on a large language model - 13 billion GPT - using
sparsity and dataflow. The dataflow execution model and architecture enables
efficient on-chip irregular memory accesses as well as native kernel fusion and
pipelined parallelism that helps recover device utilization. We show that we
can successfully train GPT 13B to the same quality as the dense GPT 13B model,
while achieving an end-end speedup of 4.5x over dense A100 baseline.
Related papers
- Pretraining Billion-scale Geospatial Foundational Models on Frontier [0.16492989697868893]
Foundation Models (FMs) are trained with internet-scale unlabeled data via self-supervised learning.
We investigate billion scale FMs and HPC training profiles for geospatial applications by pretraining on publicly available data.
Our larger 3B parameter size model achieves up to 30% improvement in top1 scene classification accuracy.
arXiv Detail & Related papers (2024-04-17T19:16:32Z) - Diffusion-Based Neural Network Weights Generation [80.89706112736353]
D2NWG is a diffusion-based neural network weights generation technique that efficiently produces high-performing weights for transfer learning.
Our method extends generative hyper-representation learning to recast the latent diffusion paradigm for neural network weights generation.
Our approach is scalable to large architectures such as large language models (LLMs), overcoming the limitations of current parameter generation techniques.
arXiv Detail & Related papers (2024-02-28T08:34:23Z) - Efficient Parallelization Layouts for Large-Scale Distributed Model Training [17.16249954009967]
We conduct a comprehensive study of possible training configurations for large language models.
We find that using a micro-batch size of 1 usually enables the most efficient training layouts.
Our most efficient configurations enable us to achieve state-of-the-art training efficiency results over a range of model sizes.
arXiv Detail & Related papers (2023-11-09T18:59:38Z) - Towards a Better Theoretical Understanding of Independent Subnetwork Training [56.24689348875711]
We take a closer theoretical look at Independent Subnetwork Training (IST)
IST is a recently proposed and highly effective technique for solving the aforementioned problems.
We identify fundamental differences between IST and alternative approaches, such as distributed methods with compressed communication.
arXiv Detail & Related papers (2023-06-28T18:14:22Z) - SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language
Models [4.114555639014612]
We show the benefits of using unstructured weight sparsity to train only a subset of weights during pre-training.
We demonstrate that we can induce up to 75% sparsity into a 1.3B parameter GPT-3 XL model resulting in a 2.5x reduction in pre-training FLOPs.
arXiv Detail & Related papers (2023-03-18T17:56:01Z) - SWARM Parallelism: Training Large Models Can Be Surprisingly
Communication-Efficient [69.61083127540776]
Deep learning applications benefit from using large models with billions of parameters.
Training these models is notoriously expensive due to the need for specialized HPC clusters.
We consider alternative setups for training large models: using cheap "preemptible" instances or pooling existing resources from multiple regions.
arXiv Detail & Related papers (2023-01-27T18:55:19Z) - Dive into Big Model Training [6.809653573125388]
Training objectives describe how to leverage web-scale data to develop extremely capable and incredibly large models.
Training methodologies which are based on distributed training describe how to make big model training a reality.
arXiv Detail & Related papers (2022-07-25T05:38:39Z) - GLaM: Efficient Scaling of Language Models with Mixture-of-Experts [84.33607245023049]
We propose and develop a family of language models named GLaM (Generalist Language Model)
GLaM uses a sparsely activated mixture-of-experts architecture to scale the model capacity while also incurring substantially less training cost compared to dense variants.
It consumes only 1/3 of the energy used to train GPT-3 and requires half of the flops for inference, while still achieving better overall zero-shot and one-shot performance across 29 NLP tasks.
arXiv Detail & Related papers (2021-12-13T18:58:19Z) - Yuan 1.0: Large-Scale Pre-trained Language Model in Zero-Shot and
Few-Shot Learning [18.932100477957462]
Recent work like GPT-3 has demonstrated excellent performance of Zero-Shot and Few-Shot learning on many natural language processing (NLP) tasks.
We propose a method that incorporates large-scale distributed training performance into model architecture design.
arXiv Detail & Related papers (2021-10-10T07:40:22Z) - M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion
Parameter Pretraining [55.16088793437898]
Training extreme-scale models requires enormous amounts of computes and memory footprint.
We propose a simple training strategy called "Pseudo-to-Real" for high-memory-footprint-required large models.
arXiv Detail & Related papers (2021-10-08T04:24:51Z) - Scalable and Efficient MoE Training for Multitask Multilingual Models [55.987536562357086]
We develop a system capable of scaling MoE models efficiently to trillions of parameters.
We also present new training methods to improve MoE sample efficiency and leverage expert pruning strategy to improve time efficiency.
A model trained with 10 billion parameters on 50 languages can achieve state-of-the-art performance in Machine Translation (MT) and multilingual natural language generation tasks.
arXiv Detail & Related papers (2021-09-22T00:57:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.