Harmony: Overcoming the hurdles of GPU memory capacity to train massive
DNN models on commodity servers
- URL: http://arxiv.org/abs/2202.01306v1
- Date: Wed, 2 Feb 2022 22:16:27 GMT
- Title: Harmony: Overcoming the hurdles of GPU memory capacity to train massive
DNN models on commodity servers
- Authors: Youjie Li, Amar Phanishayee, Derek Murray, Jakub Tarnawski, Nam Sung
Kim
- Abstract summary: Deep neural networks (DNNs) have grown exponentially in complexity and size over the past decade.
One of the main challenges for researchers who might have access to only limited resources is limited memory capacity compared to model size.
- Score: 13.620650014358413
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep neural networks (DNNs) have grown exponentially in complexity and size
over the past decade, leaving only those who have access to massive
datacenter-based resources with the ability to develop and train such models.
One of the main challenges for the long tail of researchers who might have
access to only limited resources (e.g., a single multi-GPU server) is limited
GPU memory capacity compared to model size. The problem is so acute that the
memory requirement of training large DNN models can often exceed the aggregate
capacity of all available GPUs on commodity servers; this problem only gets
worse with the trend of ever-growing model sizes. Current solutions that rely
on virtualizing GPU memory (by swapping to/from CPU memory) incur excessive
swapping overhead. In this paper, we present a new training framework, Harmony,
and advocate rethinking how DNN frameworks schedule computation and move data
to push the boundaries of training large models efficiently on modest multi-GPU
deployments. Across many large DNN models, Harmony is able to reduce swap load
by up to two orders of magnitude and obtain a training throughput speedup of up
to 7.6x over highly optimized baselines with virtualized memory.
Related papers
- MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs [55.95879347182669]
MoE architecture is renowned for its ability to increase model capacity without a proportional increase in inference cost.
MoE-Lightning introduces a novel CPU-GPU-I/O pipelining schedule, CGOPipe, with paged weights to achieve high resource utilization.
MoE-Lightning can achieve up to 10.3x higher throughput than state-of-the-art offloading-enabled LLM inference systems for Mixtral 8x7B on a single T4 GPU (16GB)
arXiv Detail & Related papers (2024-11-18T01:06:12Z) - Superpipeline: A Universal Approach for Reducing GPU Memory Usage in Large Models [40.41898661688188]
This paper introduces Superpipeline, a framework designed to optimize the execution of large AI models on constrained hardware.
Superpipeline reduces GPU memory usage by up to 60% in our experiments while maintaining model accuracy and acceptable processing speeds.
arXiv Detail & Related papers (2024-10-11T13:17:05Z) - Less Memory Means smaller GPUs: Backpropagation with Compressed Activations [1.7065506903618906]
The ever-growing scale of deep neural networks (DNNs) has lead to an equally rapid growth in computational resource requirements.
Many recent architectures, most prominently Large Language Models, have to be trained using supercomputers with thousands of accelerators.
With this approach we are able to reduce the peak memory consumption by 29% at the cost of a longer training schedule.
arXiv Detail & Related papers (2024-09-18T11:57:05Z) - AI and Memory Wall [81.06494558184049]
We show how memory bandwidth can become the dominant bottleneck for decoder models.
We argue for a redesign in model architecture, training, and deployment strategies to overcome this memory limitation.
arXiv Detail & Related papers (2024-03-21T04:31:59Z) - FusionAI: Decentralized Training and Deploying LLMs with Massive
Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU.
This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z) - FlexGen: High-Throughput Generative Inference of Large Language Models
with a Single GPU [89.2451963569343]
FlexGen is a generation engine for running large language model (LLM) inference on a single commodity GPU.
When running OPT-175B on a single 16GB GPU, FlexGen achieves significantly higher throughput compared to state-of-the-art offloading systems.
On the HELM benchmark, FlexGen can benchmark a 30B model with a 16GB GPU on 7 representative sub-scenarios in 21 hours.
arXiv Detail & Related papers (2023-03-13T05:19:28Z) - M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion
Parameter Pretraining [55.16088793437898]
Training extreme-scale models requires enormous amounts of computes and memory footprint.
We propose a simple training strategy called "Pseudo-to-Real" for high-memory-footprint-required large models.
arXiv Detail & Related papers (2021-10-08T04:24:51Z) - ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep
Learning [9.322987670900778]
ZeRO-Infinity can fit models with tens and even hundreds of trillions of parameters for training on current generation GPU clusters.
It can be used to fine-tune trillion parameter models on a single NVIDIA DGX-2 node, making large models more accessible.
arXiv Detail & Related papers (2021-04-16T02:22:12Z) - DistGNN: Scalable Distributed Training for Large-Scale Graph Neural
Networks [58.48833325238537]
Full-batch training on Graph Neural Networks (GNN) to learn the structure of large graphs is a critical problem that needs to scale to hundreds of compute nodes to be feasible.
In this paper, we presentGNN that optimize the well-known Deep Graph Library (DGL) for full-batch training on CPU clusters.
Our results on four common GNN benchmark datasets show up to 3.7x speed-up using a single CPU socket and up to 97x speed-up using 128 CPU sockets.
arXiv Detail & Related papers (2021-04-14T08:46:35Z) - ZeRO-Offload: Democratizing Billion-Scale Model Training [16.43347399073034]
ZeRO-Offload enables large model training by offloading data and compute to CPU.
It can train models with over 13 billion parameters on a single GPU, a 10x increase in size compared to popular framework such as PyTorch.
arXiv Detail & Related papers (2021-01-18T02:11:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.