Spatial Sharing of GPU for Autotuning DNN models
- URL: http://arxiv.org/abs/2008.03602v1
- Date: Sat, 8 Aug 2020 21:27:38 GMT
- Title: Spatial Sharing of GPU for Autotuning DNN models
- Authors: Aditya Dhakal, Junguk Cho, Sameer G. Kulkarni, K. K. Ramakrishnan,
Puneet Sharma
- Abstract summary: Deep Neural Network (DNN) vary widely in their ability to exploit the full power of high-performance GPU.
We present many techniques to maximize resource utilization and improve tuning performance.
- Score: 4.63732827131233
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: GPUs are used for training, inference, and tuning the machine learning
models. However, Deep Neural Network (DNN) vary widely in their ability to
exploit the full power of high-performance GPUs. Spatial sharing of GPU enables
multiplexing several DNNs on the GPU and can improve GPU utilization, thus
improving throughput and lowering latency. DNN models given just the right
amount of GPU resources can still provide low inference latency, just as much
as dedicating all of the GPU for their inference task. An approach to improve
DNN inference is tuning of the DNN model. Autotuning frameworks find the
optimal low-level implementation for a certain target device based on the
trained machine learning model, thus reducing the DNN's inference latency and
increasing inference throughput. We observe an interdependency between the
tuned model and its inference latency. A DNN model tuned with specific GPU
resources provides the best inference latency when inferred with close to the
same amount of GPU resources. While a model tuned with the maximum amount of
the GPU's resources has poorer inference latency once the GPU resources are
limited for inference. On the other hand, a model tuned with an appropriate
amount of GPU resources still achieves good inference latency across a wide
range of GPU resource availability. We explore the causes that impact the
tuning of a model at different amounts of GPU resources. We present many
techniques to maximize resource utilization and improve tuning performance. We
enable controlled spatial sharing of GPU to multiplex several tuning
applications on the GPU. We scale the tuning server instances and shard the
tuning model across multiple client instances for concurrent tuning of
different operators of a model, achieving better GPU multiplexing. With our
improvements, we decrease DNN autotuning time by up to 75 percent and increase
throughput by a factor of 5.
Related papers
- Data-driven Forecasting of Deep Learning Performance on GPUs [10.741682409837612]
NeuSight is a framework to predict the performance of various deep learning models, for both training and inference, on unseen GPUs without requiring actual execution.
NeuSight decomposes a single deep learning kernel prediction into smaller working sets called tiles, which are executed independently on the GPU.
It reduces the percentage error from 198% and 19.7% to 3.8% in predicting the latency of GPT3 model for training and inference on H100, compared to state-of-the-art prior works.
arXiv Detail & Related papers (2024-07-18T18:47:52Z) - NeRF-XL: Scaling NeRFs with Multiple GPUs [72.75214892939411]
We present NeRF-XL, a principled method for distributing Neural Radiance Fields (NeRFs) across multiple GPU.
We show improvements in reconstruction quality with larger parameter counts and speed improvements with more GPU.
We demonstrate the effectiveness of NeRF-XL on a wide variety of datasets, including the largest open-source dataset to date, MatrixCity, containing 258K images covering a 25km2 city area.
arXiv Detail & Related papers (2024-04-24T21:43:15Z) - FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency
Trade-off in Language Model Inference [57.119047493787185]
This paper shows how to reduce model size by 43.1% and bring $1.25sim1.56times$ wall clock time speedup on different hardware with negligible accuracy drop.
In practice, our method can reduce model size by 43.1% and bring $1.25sim1.56times$ wall clock time speedup on different hardware with negligible accuracy drop.
arXiv Detail & Related papers (2024-01-08T17:29:16Z) - Benchmarking GPUs on SVBRDF Extractor Model [0.0]
In this work, we try to differentiate the performance of different GPUs on neural network models that operate on bigger input images (256x256)
In this work, we tried to differentiate the performance of different GPUs on neural network models that operate on bigger input images (256x256)
arXiv Detail & Related papers (2023-10-19T17:09:06Z) - Cramming: Training a Language Model on a Single GPU in One Day [64.18297923419627]
Recent trends in language modeling have focused on increasing performance through scaling.
We investigate the downstream performance achievable with a transformer-based language model trained completely from scratch with masked language modeling for a single day on a single consumer GPU.
We provide evidence that even in this constrained setting, performance closely follows scaling laws observed in large-compute settings.
arXiv Detail & Related papers (2022-12-28T18:59:28Z) - A Study on the Intersection of GPU Utilization and CNN Inference [8.084016058894779]
We show that there is room to improve the inference-time GPU utilization of convolutional neural network (CNN) inference.
Our study makes the case that there is room to improve the inference-time GPU utilization of CNNs and that knowledge of GPU utilization has the potential to benefit even applications that do not target utilization itself.
arXiv Detail & Related papers (2022-12-15T16:11:40Z) - EVEREST: Efficient Masked Video Autoencoder by Removing Redundant Spatiotemporal Tokens [57.354304637367555]
We present EVEREST, a surprisingly efficient MVA approach for video representation learning.
It finds tokens containing rich motion features and discards uninformative ones during both pre-training and fine-tuning.
Our method significantly reduces the computation and memory requirements of MVA.
arXiv Detail & Related papers (2022-11-19T09:57:01Z) - An Analysis of Collocation on GPUs for Deep Learning Training [0.0]
Multi-Instance GPU (MIG) is a new technology introduced by NVIDIA that can partition a GPU to better-fit workloads.
In this paper, we examine the performance of a MIG-enabled A100 GPU under deep learning workloads containing various sizes and combinations of models.
arXiv Detail & Related papers (2022-09-13T14:13:06Z) - Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous
Multi-GPU Servers [65.60007071024629]
We show that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
We show experimentally that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
arXiv Detail & Related papers (2021-10-13T20:58:15Z) - Accelerating Multi-Model Inference by Merging DNNs of Different Weights [3.4123736336071864]
We propose NetFuse, a technique of merging multiple DNN models that share the same architecture but have different weights and different inputs.
Experiments on ResNet-50, ResNeXt-50, BERT, and XLNet show that NetFuse can speed up DNN inference time up to 3.6x on a NVIDIA V100 GPU.
arXiv Detail & Related papers (2020-09-28T04:33:09Z) - Hybrid Models for Learning to Branch [81.93868699246214]
We propose a new hybrid architecture for efficient branching on CPU machines.
The proposed architecture combines the expressive power of GNNs with computationally inexpensive multi-layer perceptrons (MLP) for branching.
arXiv Detail & Related papers (2020-06-26T21:03:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.