ZeRO-Offload: Democratizing Billion-Scale Model Training
- URL: http://arxiv.org/abs/2101.06840v1
- Date: Mon, 18 Jan 2021 02:11:25 GMT
- Title: ZeRO-Offload: Democratizing Billion-Scale Model Training
- Authors: Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase,
Shuangyan Yang, Minjia Zhang, Dong Li, Yuxiong He
- Abstract summary: ZeRO-Offload enables large model training by offloading data and compute to CPU.
It can train models with over 13 billion parameters on a single GPU, a 10x increase in size compared to popular framework such as PyTorch.
- Score: 16.43347399073034
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large-scale model training has been a playing ground for a limited few
requiring complex model refactoring and access to prohibitively expensive GPU
clusters. ZeRO-Offload changes the large model training landscape by making
large model training accessible to nearly everyone. It can train models with
over 13 billion parameters on a single GPU, a 10x increase in size compared to
popular framework such as PyTorch, and it does so without requiring any model
change from the data scientists or sacrificing computational efficiency.
ZeRO-Offload enables large model training by offloading data and compute to
CPU. To preserve compute efficiency, it is designed to minimize the data
movement to/from GPU, and reduce CPU compute time while maximizing memory
savings on GPU. As a result, ZeRO-Offload can achieve 40 TFlops/GPU on a single
NVIDIA V100 GPU for 10B parameter model compared to 30TF using PyTorch alone
for a 1.4B parameter model, the largest that can be trained without running out
of memory. ZeRO-Offload is also designed to scale on multiple-GPUs when
available, offering near linear speedup on up to 128 GPUs. Additionally, it can
work together with model parallelism to train models with over 70 billion
parameters on a single DGX-2 box, a 4.5x increase in model size compared to
using model parallelism alone. By combining compute and memory efficiency with
ease-of-use, ZeRO-Offload democratizes large-scale model training making it
accessible to even data scientists with access to just a single GPU.
Related papers
- MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs [55.95879347182669]
MoE architecture is renowned for its ability to increase model capacity without a proportional increase in inference cost.
MoE-Lightning introduces a novel CPU-GPU-I/O pipelining schedule, CGOPipe, with paged weights to achieve high resource utilization.
MoE-Lightning can achieve up to 10.3x higher throughput than state-of-the-art offloading-enabled LLM inference systems for Mixtral 8x7B on a single T4 GPU (16GB)
arXiv Detail & Related papers (2024-11-18T01:06:12Z) - TRAK: Attributing Model Behavior at Scale [79.56020040993947]
We present TRAK (Tracing with Randomly-trained After Kernel), a data attribution method that is both effective and computationally tractable for large-scale, differenti models.
arXiv Detail & Related papers (2023-03-24T17:56:22Z) - FlexGen: High-Throughput Generative Inference of Large Language Models
with a Single GPU [89.2451963569343]
FlexGen is a generation engine for running large language model (LLM) inference on a single commodity GPU.
When running OPT-175B on a single 16GB GPU, FlexGen achieves significantly higher throughput compared to state-of-the-art offloading systems.
On the HELM benchmark, FlexGen can benchmark a 30B model with a 16GB GPU on 7 representative sub-scenarios in 21 hours.
arXiv Detail & Related papers (2023-03-13T05:19:28Z) - An Analysis of Collocation on GPUs for Deep Learning Training [0.0]
Multi-Instance GPU (MIG) is a new technology introduced by NVIDIA that can partition a GPU to better-fit workloads.
In this paper, we examine the performance of a MIG-enabled A100 GPU under deep learning workloads containing various sizes and combinations of models.
arXiv Detail & Related papers (2022-09-13T14:13:06Z) - Petals: Collaborative Inference and Fine-tuning of Large Models [78.37798144357977]
Many NLP tasks benefit from using large language models (LLMs) that often have more than 100 billion parameters.
With the release of BLOOM-176B and OPT-175B, everyone can download pretrained models of this scale.
We propose Petals $-$ a system for inference and fine-tuning of large models collaboratively by joining the resources of multiple parties.
arXiv Detail & Related papers (2022-09-02T17:38:03Z) - Harmony: Overcoming the hurdles of GPU memory capacity to train massive
DNN models on commodity servers [13.620650014358413]
Deep neural networks (DNNs) have grown exponentially in complexity and size over the past decade.
One of the main challenges for researchers who might have access to only limited resources is limited memory capacity compared to model size.
arXiv Detail & Related papers (2022-02-02T22:16:27Z) - Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous
Multi-GPU Servers [65.60007071024629]
We show that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
We show experimentally that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
arXiv Detail & Related papers (2021-10-13T20:58:15Z) - M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion
Parameter Pretraining [55.16088793437898]
Training extreme-scale models requires enormous amounts of computes and memory footprint.
We propose a simple training strategy called "Pseudo-to-Real" for high-memory-footprint-required large models.
arXiv Detail & Related papers (2021-10-08T04:24:51Z) - ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep
Learning [9.322987670900778]
ZeRO-Infinity can fit models with tens and even hundreds of trillions of parameters for training on current generation GPU clusters.
It can be used to fine-tune trillion parameter models on a single NVIDIA DGX-2 node, making large models more accessible.
arXiv Detail & Related papers (2021-04-16T02:22:12Z) - Efficient Large-Scale Language Model Training on GPU Clusters [19.00915720435389]
Large language models have led to state-of-the-art accuracies across a range of tasks.
Memory capacity is limited, making it impossible to fit large models on a single GPU.
The number of compute operations required to train these models can result in unrealistically long training times.
arXiv Detail & Related papers (2021-04-09T16:43:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.