Harmony: Overcoming the hurdles of GPU memory capacity to train massive
DNN models on commodity servers
- URL: http://arxiv.org/abs/2202.01306v1
- Date: Wed, 2 Feb 2022 22:16:27 GMT
- Title: Harmony: Overcoming the hurdles of GPU memory capacity to train massive
DNN models on commodity servers
- Authors: Youjie Li, Amar Phanishayee, Derek Murray, Jakub Tarnawski, Nam Sung
Kim
- Abstract summary: Deep neural networks (DNNs) have grown exponentially in complexity and size over the past decade.
One of the main challenges for researchers who might have access to only limited resources is limited memory capacity compared to model size.
- Score: 13.620650014358413
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep neural networks (DNNs) have grown exponentially in complexity and size
over the past decade, leaving only those who have access to massive
datacenter-based resources with the ability to develop and train such models.
One of the main challenges for the long tail of researchers who might have
access to only limited resources (e.g., a single multi-GPU server) is limited
GPU memory capacity compared to model size. The problem is so acute that the
memory requirement of training large DNN models can often exceed the aggregate
capacity of all available GPUs on commodity servers; this problem only gets
worse with the trend of ever-growing model sizes. Current solutions that rely
on virtualizing GPU memory (by swapping to/from CPU memory) incur excessive
swapping overhead. In this paper, we present a new training framework, Harmony,
and advocate rethinking how DNN frameworks schedule computation and move data
to push the boundaries of training large models efficiently on modest multi-GPU
deployments. Across many large DNN models, Harmony is able to reduce swap load
by up to two orders of magnitude and obtain a training throughput speedup of up
to 7.6x over highly optimized baselines with virtualized memory.
Related papers
- AI and Memory Wall [81.06494558184049]
We show how memory bandwidth can become the dominant bottleneck for decoder models.
We argue for a redesign in model architecture, training, and deployment strategies to overcome this memory limitation.
arXiv Detail & Related papers (2024-03-21T04:31:59Z) - GraNNDis: Efficient Unified Distributed Training Framework for Deep GNNs
on Large Clusters [8.137466511979586]
Graph neural networks (GNNs) are one of the most rapidly growing fields within deep learning.
GraNNDis is an efficient distributed GNN training framework for training GNNs on large graphs and deep layers.
GraNNDis provides superior speedup over the state-of-the-art distributed GNN training frameworks.
arXiv Detail & Related papers (2023-11-12T13:30:31Z) - FusionAI: Decentralized Training and Deploying LLMs with Massive
Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU.
This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z) - FlexGen: High-Throughput Generative Inference of Large Language Models
with a Single GPU [89.2451963569343]
FlexGen is a generation engine for running large language model (LLM) inference on a single commodity GPU.
When running OPT-175B on a single 16GB GPU, FlexGen achieves significantly higher throughput compared to state-of-the-art offloading systems.
On the HELM benchmark, FlexGen can benchmark a 30B model with a 16GB GPU on 7 representative sub-scenarios in 21 hours.
arXiv Detail & Related papers (2023-03-13T05:19:28Z) - On-Device Training Under 256KB Memory [62.95579393237751]
We propose an algorithm-system co-design framework to make on-device training possible with only 256KB of memory.
Our framework is the first solution to enable tiny on-device training of convolutional neural networks under 256KB and 1MB Flash.
arXiv Detail & Related papers (2022-06-30T17:59:08Z) - M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion
Parameter Pretraining [55.16088793437898]
Training extreme-scale models requires enormous amounts of computes and memory footprint.
We propose a simple training strategy called "Pseudo-to-Real" for high-memory-footprint-required large models.
arXiv Detail & Related papers (2021-10-08T04:24:51Z) - ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep
Learning [9.322987670900778]
ZeRO-Infinity can fit models with tens and even hundreds of trillions of parameters for training on current generation GPU clusters.
It can be used to fine-tune trillion parameter models on a single NVIDIA DGX-2 node, making large models more accessible.
arXiv Detail & Related papers (2021-04-16T02:22:12Z) - DistGNN: Scalable Distributed Training for Large-Scale Graph Neural
Networks [58.48833325238537]
Full-batch training on Graph Neural Networks (GNN) to learn the structure of large graphs is a critical problem that needs to scale to hundreds of compute nodes to be feasible.
In this paper, we presentGNN that optimize the well-known Deep Graph Library (DGL) for full-batch training on CPU clusters.
Our results on four common GNN benchmark datasets show up to 3.7x speed-up using a single CPU socket and up to 97x speed-up using 128 CPU sockets.
arXiv Detail & Related papers (2021-04-14T08:46:35Z) - ZeRO-Offload: Democratizing Billion-Scale Model Training [16.43347399073034]
ZeRO-Offload enables large model training by offloading data and compute to CPU.
It can train models with over 13 billion parameters on a single GPU, a 10x increase in size compared to popular framework such as PyTorch.
arXiv Detail & Related papers (2021-01-18T02:11:25Z) - Accelerating Multi-Model Inference by Merging DNNs of Different Weights [3.4123736336071864]
We propose NetFuse, a technique of merging multiple DNN models that share the same architecture but have different weights and different inputs.
Experiments on ResNet-50, ResNeXt-50, BERT, and XLNet show that NetFuse can speed up DNN inference time up to 3.6x on a NVIDIA V100 GPU.
arXiv Detail & Related papers (2020-09-28T04:33:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.