Related papers: Nimble: Lightweight and Parallel GPU Task Scheduling for Deep Learning

Nimble: Lightweight and Parallel GPU Task Scheduling for Deep Learning

URL: http://arxiv.org/abs/2012.02732v1
Date: Fri, 4 Dec 2020 17:25:46 GMT
Title: Nimble: Lightweight and Parallel GPU Task Scheduling for Deep Learning
Authors: Woosuk Kwon, Gyeong-In Yu, Eunji Jeong, Byung-Gon Chun
Abstract summary: We propose Nimble, a deep learning (DL) execution engine that runs tasks in parallel with minimal scheduling overhead. Nable automatically parallelizes the execution of GPU tasks by exploiting multiple GPU streams in a single GPU. evaluation on a variety of neural networks shows that compared to PyTorch, Nimble speeds up inference and training by up to 22.34$times$ and 3.61$times$, respectively.
Score: 7.43260596107574
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deep learning (DL) frameworks take advantage of GPUs to improve the speed of DL inference and training. Ideally, DL frameworks should be able to fully utilize the computation power of GPUs such that the running time depends on the amount of computation assigned to GPUs. Yet, we observe that in scheduling GPU tasks, existing DL frameworks suffer from inefficiencies such as large scheduling overhead and unnecessary serial execution. To this end, we propose Nimble, a DL execution engine that runs GPU tasks in parallel with minimal scheduling overhead. Nimble introduces a novel technique called ahead-of-time (AoT) scheduling. Here, the scheduling procedure finishes before executing the GPU kernel, thereby removing most of the scheduling overhead during run time. Furthermore, Nimble automatically parallelizes the execution of GPU tasks by exploiting multiple GPU streams in a single GPU. Evaluation on a variety of neural networks shows that compared to PyTorch, Nimble speeds up inference and training by up to 22.34$\times$ and 3.61$\times$, respectively. Moreover, Nimble outperforms state-of-the-art inference systems, TensorRT and TVM, by up to 2.81$\times$ and 1.70$\times$, respectively.

Related papers

SGPRS: Seamless GPU Partitioning Real-Time Scheduler for Periodic Deep Learning Workloads [0.9898607871253774]
We propose SGPRS, the first real-time GPU scheduler considering zero configuration partition switch. The proposed scheduler not only meets more deadlines for parallel tasks but also sustains overall performance beyond the pivot point.
arXiv Detail & Related papers (2024-04-13T18:29:26Z)
FIKIT: Priority-Based Real-time GPU Multi-tasking Scheduling with Kernel Identification [2.9271819018953162]
In a cloud computing cluster, serving a GPU's computation power through multi-tasks sharing is highly demanded. Existing GPU sharing solutions focus on reducing task-level waiting time or task-level switching costs when multiple jobs competing for a single GPU. We present a novel kernel-level scheduling strategy called FIKIT: Filling Inter- Kernel Idle Time.
arXiv Detail & Related papers (2023-11-17T07:25:18Z)
PARTIME: Scalable and Parallel Processing Over Time with Deep Neural Networks [68.96484488899901]
We present PARTIME, a library designed to speed up neural networks whenever data is continuously streamed over time. PARTIME starts processing each data sample at the time in which it becomes available from the stream. Experiments are performed in order to empirically compare PARTIME with classic non-parallel neural computations in online learning.
arXiv Detail & Related papers (2022-10-17T14:49:14Z)
Batch-efficient EigenDecomposition for Small and Medium Matrices [65.67315418971688]
EigenDecomposition (ED) is at the heart of many computer vision algorithms and applications. We propose a QR-based ED method dedicated to the application scenarios of computer vision.
arXiv Detail & Related papers (2022-07-09T09:14:12Z)
PLSSVM: A (multi-)GPGPU-accelerated Least Squares Support Vector Machine [68.8204255655161]
Support Vector Machines (SVMs) are widely used in machine learning. However, even modern and optimized implementations do not scale well for large non-trivial dense data sets on cutting-edge hardware. PLSSVM can be used as a drop-in replacement for an LVM.
arXiv Detail & Related papers (2022-02-25T13:24:23Z)
AxoNN: An asynchronous, message-driven parallel framework for extreme-scale deep learning [1.5301777464637454]
AxoNN is a parallel deep learning framework that exploits asynchrony and message-driven execution to schedule neural network operations on each GPU. By using the CPU memory as a scratch space for offloading data periodically during training, AxoNN is able to reduce GPU memory consumption by four times.
arXiv Detail & Related papers (2021-10-25T14:43:36Z)
Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous Multi-GPU Servers [65.60007071024629]
We show that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy. We show experimentally that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
arXiv Detail & Related papers (2021-10-13T20:58:15Z)
Scheduling Optimization Techniques for Neural Network Training [3.1617796705744547]
This paper proposes out-of-order (ooo) backprop, an effective scheduling technique for neural network training. We show that the GPU utilization in single-GPU, data-parallel, and pipeline-parallel training can be commonly improve by applying ooo backprop.
arXiv Detail & Related papers (2021-10-03T05:45:06Z)
RTGPU: Real-Time GPU Scheduling of Hard Deadline Parallel Tasks with Fine-Grain Utilization [5.02836935036198]
We propose RTGPU, which can schedule the execution of multiple GPU applications in real-time to meet hard deadlines. Our approach provides superior schedulability compared with previous work, and gives real-time guarantees to meet hard deadlines for multiple GPU applications.
arXiv Detail & Related papers (2021-01-25T22:34:06Z)
Hybrid Models for Learning to Branch [81.93868699246214]
We propose a new hybrid architecture for efficient branching on CPU machines. The proposed architecture combines the expressive power of GNNs with computationally inexpensive multi-layer perceptrons (MLP) for branching.
arXiv Detail & Related papers (2020-06-26T21:03:45Z)
Kernel methods through the roof: handling billions of points efficiently [94.31450736250918]
Kernel methods provide an elegant and principled approach to nonparametric learning, but so far could hardly be used in large scale problems. Recent advances have shown the benefits of a number of algorithmic ideas, for example combining optimization, numerical linear algebra and random projections. Here, we push these efforts further to develop and test a solver that takes full advantage of GPU hardware.
arXiv Detail & Related papers (2020-06-18T08:16:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.