Miriam: Exploiting Elastic Kernels for Real-time Multi-DNN Inference on
Edge GPU
- URL: http://arxiv.org/abs/2307.04339v1
- Date: Mon, 10 Jul 2023 04:30:44 GMT
- Title: Miriam: Exploiting Elastic Kernels for Real-time Multi-DNN Inference on
Edge GPU
- Authors: Zhihe Zhao, Neiwen Ling, Nan Guan, Guoliang Xing
- Abstract summary: concurrent running of multiple deep neural networks (DNN)
Miriam is a contention-aware task coordination framework for multi-DNN inference on edge GPU.
- Score: 7.972518585452826
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many applications such as autonomous driving and augmented reality, require
the concurrent running of multiple deep neural networks (DNN) that poses
different levels of real-time performance requirements. However, coordinating
multiple DNN tasks with varying levels of criticality on edge GPUs remains an
area of limited study. Unlike server-level GPUs, edge GPUs are resource-limited
and lack hardware-level resource management mechanisms for avoiding resource
contention. Therefore, we propose Miriam, a contention-aware task coordination
framework for multi-DNN inference on edge GPU. Miriam consolidates two main
components, an elastic-kernel generator, and a runtime dynamic kernel
coordinator, to support mixed critical DNN inference. To evaluate Miriam, we
build a new DNN inference benchmark based on CUDA with diverse representative
DNN workloads. Experiments on two edge GPU platforms show that Miriam can
increase system throughput by 92% while only incurring less than 10\% latency
overhead for critical tasks, compared to state of art baselines.
Related papers
- FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - MTL-Split: Multi-Task Learning for Edge Devices using Split Computing [11.357748232689628]
Split Computing (SC) is where a Deep Neural Network (DNN) is intelligently split with a part of it deployed on an edge device and the rest on a remote server.
This paper studies this problem, and MTL-Split, our novel proposed architecture, shows encouraging results on both synthetic and real-world data.
arXiv Detail & Related papers (2024-07-08T14:25:39Z) - SGPRS: Seamless GPU Partitioning Real-Time Scheduler for Periodic Deep Learning Workloads [0.9898607871253774]
We propose SGPRS, the first real-time GPU scheduler considering zero configuration partition switch.
The proposed scheduler not only meets more deadlines for parallel tasks but also sustains overall performance beyond the pivot point.
arXiv Detail & Related papers (2024-04-13T18:29:26Z) - Dynamic DNNs and Runtime Management for Efficient Inference on
Mobile/Embedded Devices [2.8851756275902476]
Deep neural network (DNN) inference is increasingly being executed on mobile and embedded platforms.
We co-designed novel Dynamic Super-Networks to maximise system-level performance and energy efficiency.
Compared with SOTA, our experimental results using ImageNet on the GPU of Jetson Xavier NX show our model is 2.4x faster for similar ImageNet Top-1 accuracy, or 5.1% higher accuracy at similar latency.
arXiv Detail & Related papers (2024-01-17T04:40:30Z) - Sparse-DySta: Sparsity-Aware Dynamic and Static Scheduling for Sparse
Multi-DNN Workloads [65.47816359465155]
Running multiple deep neural networks (DNNs) in parallel has become an emerging workload in both edge devices.
We propose Dysta, a novel scheduler that utilizes both static sparsity patterns and dynamic sparsity information for the sparse multi-DNN scheduling.
Our proposed approach outperforms the state-of-the-art methods with up to 10% decrease in latency constraint violation rate and nearly 4X reduction in average normalized turnaround time.
arXiv Detail & Related papers (2023-10-17T09:25:17Z) - An efficient and flexible inference system for serving heterogeneous
ensembles of deep neural networks [0.0]
Ensembles of Deep Neural Networks (DNNs) have achieved qualitative predictions but they are computing and memory intensive.
We propose a new software layer to serve with flexibility and efficiency ensembles of DNNs.
arXiv Detail & Related papers (2022-08-30T08:05:43Z) - Towards a General Purpose CNN for Long Range Dependencies in
$\mathrm{N}$D [49.57261544331683]
We propose a single CNN architecture equipped with continuous convolutional kernels for tasks on arbitrary resolution, dimensionality and length without structural changes.
We show the generality of our approach by applying the same CCNN to a wide set of tasks on sequential (1$mathrmD$) and visual data (2$mathrmD$)
Our CCNN performs competitively and often outperforms the current state-of-the-art across all tasks considered.
arXiv Detail & Related papers (2022-06-07T15:48:02Z) - Dynamic Split Computing for Efficient Deep Edge Intelligence [78.4233915447056]
We introduce dynamic split computing, where the optimal split location is dynamically selected based on the state of the communication channel.
We show that dynamic split computing achieves faster inference in edge computing environments where the data rate and server load vary over time.
arXiv Detail & Related papers (2022-05-23T12:35:18Z) - Synergy: Resource Sensitive DNN Scheduling in Multi-Tenant Clusters [10.38396444951436]
Training Deep Neural Networks (DNNs) is a widely popular workload in both enterprises and cloud data centers.
We propose Synergy, a resource-sensitive scheduler for shared GPU clusters.
Our experiments show that workload-aware CPU and memory allocations can improve average JCT up to 3.4x when compared to traditional GPU-proportional scheduling.
arXiv Detail & Related papers (2021-10-12T15:25:54Z) - Hybrid Models for Learning to Branch [81.93868699246214]
We propose a new hybrid architecture for efficient branching on CPU machines.
The proposed architecture combines the expressive power of GNNs with computationally inexpensive multi-layer perceptrons (MLP) for branching.
arXiv Detail & Related papers (2020-06-26T21:03:45Z) - PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with
Pattern-based Weight Pruning [57.20262984116752]
We introduce a new dimension, fine-grained pruning patterns inside the coarse-grained structures, revealing a previously unknown point in design space.
With the higher accuracy enabled by fine-grained pruning patterns, the unique insight is to use the compiler to re-gain and guarantee high hardware efficiency.
arXiv Detail & Related papers (2020-01-01T04:52:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.