Elixir: Train a Large Language Model on a Small GPU Cluster
- URL: http://arxiv.org/abs/2212.05339v3
- Date: Wed, 31 May 2023 13:56:53 GMT
- Title: Elixir: Train a Large Language Model on a Small GPU Cluster
- Authors: Haichen Huang and Jiarui Fang and Hongxin Liu and Shenggui Li and Yang
You
- Abstract summary: Large language models have achieved great success due to their unprecedented size.
Elixir automates efficient large-model training based on pre-runtime model profiling.
Elixir significantly outperforms the current state-of-the-art baseline.
- Score: 6.578131399847817
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, large language models have achieved great success due to
their unprecedented size. However, training these models poses a challenge for
most researchers as it requires a substantial number of GPUs. To reduce GPU
memory usage, memory partitioning, and memory offloading have been proposed.
These approaches eliminate memory redundancies and offload memory usage to the
CPU and NVMe memory, respectively, enabling training on small GPU clusters.
However, directly deploying these solutions often leads to suboptimal
efficiency. Only experienced experts can unleash the full potential of hardware
by carefully tuning the distributed configuration. Thus, we present a novel
solution, Elixir, which automates efficient large-model training based on
pre-runtime model profiling. Elixir aims to identify the optimal combination of
partitioning and offloading techniques to maximize training throughput. In our
experiments, Elixir significantly outperforms the current state-of-the-art
baseline. Our optimal configuration achieves up to a 3.4$\times$ speedup on
GPT-2 models compared with SOTA solutions. We hope that our work will benefit
individuals who lack computing resources and expertise, granting them access to
large models. The beta version of Elixir is now available at
https://github.com/hpcaitech/ColossalAI/tree/feature/elixir.
Related papers
- Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment [56.44025052765861]
Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks.
We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs.
We show a total speedup on CPUs for sparse-quantized LLaMA models of up to 8.6x.
arXiv Detail & Related papers (2024-05-06T16:03:32Z) - AI and Memory Wall [81.06494558184049]
We show how memory bandwidth can become the dominant bottleneck for decoder models.
We argue for a redesign in model architecture, training, and deployment strategies to overcome this memory limitation.
arXiv Detail & Related papers (2024-03-21T04:31:59Z) - FlexGen: High-Throughput Generative Inference of Large Language Models
with a Single GPU [89.2451963569343]
FlexGen is a generation engine for running large language model (LLM) inference on a single commodity GPU.
When running OPT-175B on a single 16GB GPU, FlexGen achieves significantly higher throughput compared to state-of-the-art offloading systems.
On the HELM benchmark, FlexGen can benchmark a 30B model with a 16GB GPU on 7 representative sub-scenarios in 21 hours.
arXiv Detail & Related papers (2023-03-13T05:19:28Z) - Cramming: Training a Language Model on a Single GPU in One Day [64.18297923419627]
Recent trends in language modeling have focused on increasing performance through scaling.
We investigate the downstream performance achievable with a transformer-based language model trained completely from scratch with masked language modeling for a single day on a single consumer GPU.
We provide evidence that even in this constrained setting, performance closely follows scaling laws observed in large-compute settings.
arXiv Detail & Related papers (2022-12-28T18:59:28Z) - An Analysis of Collocation on GPUs for Deep Learning Training [0.0]
Multi-Instance GPU (MIG) is a new technology introduced by NVIDIA that can partition a GPU to better-fit workloads.
In this paper, we examine the performance of a MIG-enabled A100 GPU under deep learning workloads containing various sizes and combinations of models.
arXiv Detail & Related papers (2022-09-13T14:13:06Z) - On-Device Training Under 256KB Memory [62.95579393237751]
We propose an algorithm-system co-design framework to make on-device training possible with only 256KB of memory.
Our framework is the first solution to enable tiny on-device training of convolutional neural networks under 256KB and 1MB Flash.
arXiv Detail & Related papers (2022-06-30T17:59:08Z) - M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion
Parameter Pretraining [55.16088793437898]
Training extreme-scale models requires enormous amounts of computes and memory footprint.
We propose a simple training strategy called "Pseudo-to-Real" for high-memory-footprint-required large models.
arXiv Detail & Related papers (2021-10-08T04:24:51Z) - PatrickStar: Parallel Training of Pre-trained Models via a Chunk-based
Memory Management [19.341284825473558]
Pre-trained model (PTM) is revolutionizing Artificial intelligence (AI) technology.
PTM learns a model with general language features on the vast text and then fine-tunes the model using a task-specific dataset.
PatrickStar reduces memory requirements of computing platforms by using heterogeneous memory space.
arXiv Detail & Related papers (2021-08-12T15:58:12Z) - ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep
Learning [9.322987670900778]
ZeRO-Infinity can fit models with tens and even hundreds of trillions of parameters for training on current generation GPU clusters.
It can be used to fine-tune trillion parameter models on a single NVIDIA DGX-2 node, making large models more accessible.
arXiv Detail & Related papers (2021-04-16T02:22:12Z) - Efficient Large-Scale Language Model Training on GPU Clusters [19.00915720435389]
Large language models have led to state-of-the-art accuracies across a range of tasks.
Memory capacity is limited, making it impossible to fit large models on a single GPU.
The number of compute operations required to train these models can result in unrealistically long training times.
arXiv Detail & Related papers (2021-04-09T16:43:11Z) - ZeRO-Offload: Democratizing Billion-Scale Model Training [16.43347399073034]
ZeRO-Offload enables large model training by offloading data and compute to CPU.
It can train models with over 13 billion parameters on a single GPU, a 10x increase in size compared to popular framework such as PyTorch.
arXiv Detail & Related papers (2021-01-18T02:11:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.