An Analysis of Collocation on GPUs for Deep Learning Training
- URL: http://arxiv.org/abs/2209.06018v3
- Date: Mon, 24 Apr 2023 08:46:16 GMT
- Title: An Analysis of Collocation on GPUs for Deep Learning Training
- Authors: Ties Robroek, Ehsan Yousefzadeh-Asl-Miandoab, P{\i}nar T\"oz\"un
- Abstract summary: Multi-Instance GPU (MIG) is a new technology introduced by NVIDIA that can partition a GPU to better-fit workloads.
In this paper, we examine the performance of a MIG-enabled A100 GPU under deep learning workloads containing various sizes and combinations of models.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep learning training is an expensive process that extensively uses GPUs,
but not all model training saturates modern powerful GPUs. Multi-Instance GPU
(MIG) is a new technology introduced by NVIDIA that can partition a GPU to
better-fit workloads that do not require all the memory and compute resources
of a full GPU. In this paper, we examine the performance of a MIG-enabled A100
GPU under deep learning workloads containing various sizes and combinations of
models. We contrast the benefits of MIG to older workload collocation methods
on GPUs: na\"ively submitting multiple processes on the same GPU and utilizing
Multi-Process Service (MPS). Our results demonstrate that collocating multiple
model training runs may yield significant benefits. In certain cases, it can
lead up to four times training throughput despite increased epoch time. On the
other hand, the aggregate memory footprint and compute needs of the models
trained in parallel must fit the available memory and compute resources of the
GPU. MIG can be beneficial thanks to its interference-free partitioning,
especially when the sizes of the models align with the MIG partitioning
options. MIG's rigid partitioning, however, may create sub-optimal GPU
utilization for more dynamic mixed workloads. In general, we recommend MPS as
the best performing and most flexible form of collocation for model training
for a single user submitting training jobs.
Related papers
- MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs [55.95879347182669]
MoE architecture is renowned for its ability to increase model capacity without a proportional increase in inference cost.
MoE-Lightning introduces a novel CPU-GPU-I/O pipelining schedule, CGOPipe, with paged weights to achieve high resource utilization.
MoE-Lightning can achieve up to 10.3x higher throughput than state-of-the-art offloading-enabled LLM inference systems for Mixtral 8x7B on a single T4 GPU (16GB)
arXiv Detail & Related papers (2024-11-18T01:06:12Z) - Deep Optimizer States: Towards Scalable Training of Transformer Models Using Interleaved Offloading [2.8231000588510757]
Transformers and large language models(LLMs) have seen rapid adoption in all domains.
Training of transformers is very expensive and often hits a memory wall''
We propose a novel technique to split the LLM into subgroups, whose update phase is scheduled on either the CPU or the GPU.
arXiv Detail & Related papers (2024-10-26T00:43:59Z) - Hierarchical Resource Partitioning on Modern GPUs: A Reinforcement Learning Approach [1.076745840431781]
We propose a method for comprehensively co-optimizing the setup of hierarchical partitioning and the selection of co-scheduling groups from a given set of jobs.
This results in a maximum throughput improvement by a factor of 1.87 compared to the time-sharing scheduling.
arXiv Detail & Related papers (2024-05-14T16:40:06Z) - Benchmarking GPUs on SVBRDF Extractor Model [0.0]
In this work, we try to differentiate the performance of different GPUs on neural network models that operate on bigger input images (256x256)
In this work, we tried to differentiate the performance of different GPUs on neural network models that operate on bigger input images (256x256)
arXiv Detail & Related papers (2023-10-19T17:09:06Z) - FusionAI: Decentralized Training and Deploying LLMs with Massive
Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU.
This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z) - FlexGen: High-Throughput Generative Inference of Large Language Models
with a Single GPU [89.2451963569343]
FlexGen is a generation engine for running large language model (LLM) inference on a single commodity GPU.
When running OPT-175B on a single 16GB GPU, FlexGen achieves significantly higher throughput compared to state-of-the-art offloading systems.
On the HELM benchmark, FlexGen can benchmark a 30B model with a 16GB GPU on 7 representative sub-scenarios in 21 hours.
arXiv Detail & Related papers (2023-03-13T05:19:28Z) - EVEREST: Efficient Masked Video Autoencoder by Removing Redundant Spatiotemporal Tokens [57.354304637367555]
We present EVEREST, a surprisingly efficient MVA approach for video representation learning.
It finds tokens containing rich motion features and discards uninformative ones during both pre-training and fine-tuning.
Our method significantly reduces the computation and memory requirements of MVA.
arXiv Detail & Related papers (2022-11-19T09:57:01Z) - A Frequency-aware Software Cache for Large Recommendation System
Embeddings [11.873521953539361]
Deep learning recommendation models (DLRMs) have been widely applied in Internet companies.
We propose a GPU-based software cache approaches to dynamically manage the embedding table in the CPU and GPU memory space.
Our proposed software cache is efficient in training entire DLRMs on GPU in a synchronized update manner.
arXiv Detail & Related papers (2022-08-08T12:08:05Z) - PARIS and ELSA: An Elastic Scheduling Algorithm for Reconfigurable
Multi-GPU Inference Servers [0.9854614058492648]
NVIDIA's Ampere GPU architecture provides features to "reconfigure" one large, monolithic GPU into multiple smaller "GPU partitions"
In this paper, we study this emerging GPU architecture with reconfigurability to develop a high-performance multi-GPU ML inference server.
arXiv Detail & Related papers (2022-02-27T23:30:55Z) - PLSSVM: A (multi-)GPGPU-accelerated Least Squares Support Vector Machine [68.8204255655161]
Support Vector Machines (SVMs) are widely used in machine learning.
However, even modern and optimized implementations do not scale well for large non-trivial dense data sets on cutting-edge hardware.
PLSSVM can be used as a drop-in replacement for an LVM.
arXiv Detail & Related papers (2022-02-25T13:24:23Z) - Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous
Multi-GPU Servers [65.60007071024629]
We show that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
We show experimentally that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
arXiv Detail & Related papers (2021-10-13T20:58:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.