Related papers: Miriam: Exploiting Elastic Kernels for Real-time Multi-DNN Inference on Edge GPU

Miriam: Exploiting Elastic Kernels for Real-time Multi-DNN Inference on Edge GPU

URL: http://arxiv.org/abs/2307.04339v1
Date: Mon, 10 Jul 2023 04:30:44 GMT
Title: Miriam: Exploiting Elastic Kernels for Real-time Multi-DNN Inference on Edge GPU
Authors: Zhihe Zhao, Neiwen Ling, Nan Guan, Guoliang Xing
Abstract summary: concurrent running of multiple deep neural networks (DNN) Miriam is a contention-aware task coordination framework for multi-DNN inference on edge GPU.
Score: 7.972518585452826
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Many applications such as autonomous driving and augmented reality, require the concurrent running of multiple deep neural networks (DNN) that poses different levels of real-time performance requirements. However, coordinating multiple DNN tasks with varying levels of criticality on edge GPUs remains an area of limited study. Unlike server-level GPUs, edge GPUs are resource-limited and lack hardware-level resource management mechanisms for avoiding resource contention. Therefore, we propose Miriam, a contention-aware task coordination framework for multi-DNN inference on edge GPU. Miriam consolidates two main components, an elastic-kernel generator, and a runtime dynamic kernel coordinator, to support mixed critical DNN inference. To evaluate Miriam, we build a new DNN inference benchmark based on CUDA with diverse representative DNN workloads. Experiments on two edge GPU platforms show that Miriam can increase system throughput by 92% while only incurring less than 10\% latency overhead for critical tasks, compared to state of art baselines.

Related papers

FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency. We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs) We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z)
MTL-Split: Multi-Task Learning for Edge Devices using Split Computing [11.357748232689628]
Split Computing (SC) is where a Deep Neural Network (DNN) is intelligently split with a part of it deployed on an edge device and the rest on a remote server. This paper studies this problem, and MTL-Split, our novel proposed architecture, shows encouraging results on both synthetic and real-world data.
arXiv Detail & Related papers (2024-07-08T14:25:39Z)
SGPRS: Seamless GPU Partitioning Real-Time Scheduler for Periodic Deep Learning Workloads [0.9898607871253774]
We propose SGPRS, the first real-time GPU scheduler considering zero configuration partition switch. The proposed scheduler not only meets more deadlines for parallel tasks but also sustains overall performance beyond the pivot point.
arXiv Detail & Related papers (2024-04-13T18:29:26Z)
Dynamic DNNs and Runtime Management for Efficient Inference on Mobile/Embedded Devices [2.8851756275902476]
Deep neural network (DNN) inference is increasingly being executed on mobile and embedded platforms. We co-designed novel Dynamic Super-Networks to maximise system-level performance and energy efficiency. Compared with SOTA, our experimental results using ImageNet on the GPU of Jetson Xavier NX show our model is 2.4x faster for similar ImageNet Top-1 accuracy, or 5.1% higher accuracy at similar latency.
arXiv Detail & Related papers (2024-01-17T04:40:30Z)
Sparse-DySta: Sparsity-Aware Dynamic and Static Scheduling for Sparse Multi-DNN Workloads [65.47816359465155]
Running multiple deep neural networks (DNNs) in parallel has become an emerging workload in both edge devices. We propose Dysta, a novel scheduler that utilizes both static sparsity patterns and dynamic sparsity information for the sparse multi-DNN scheduling. Our proposed approach outperforms the state-of-the-art methods with up to 10% decrease in latency constraint violation rate and nearly 4X reduction in average normalized turnaround time.
arXiv Detail & Related papers (2023-10-17T09:25:17Z)
An efficient and flexible inference system for serving heterogeneous ensembles of deep neural networks [0.0]
Ensembles of Deep Neural Networks (DNNs) have achieved qualitative predictions but they are computing and memory intensive. We propose a new software layer to serve with flexibility and efficiency ensembles of DNNs.
arXiv Detail & Related papers (2022-08-30T08:05:43Z)
Towards a General Purpose CNN for Long Range Dependencies in $\mathrm{N}$D [49.57261544331683]
We propose a single CNN architecture equipped with continuous convolutional kernels for tasks on arbitrary resolution, dimensionality and length without structural changes. We show the generality of our approach by applying the same CCNN to a wide set of tasks on sequential (1$mathrmD$) and visual data (2$mathrmD$) Our CCNN performs competitively and often outperforms the current state-of-the-art across all tasks considered.
arXiv Detail & Related papers (2022-06-07T15:48:02Z)
Dynamic Split Computing for Efficient Deep Edge Intelligence [78.4233915447056]
We introduce dynamic split computing, where the optimal split location is dynamically selected based on the state of the communication channel. We show that dynamic split computing achieves faster inference in edge computing environments where the data rate and server load vary over time.
arXiv Detail & Related papers (2022-05-23T12:35:18Z)
Synergy: Resource Sensitive DNN Scheduling in Multi-Tenant Clusters [10.38396444951436]
Training Deep Neural Networks (DNNs) is a widely popular workload in both enterprises and cloud data centers. We propose Synergy, a resource-sensitive scheduler for shared GPU clusters. Our experiments show that workload-aware CPU and memory allocations can improve average JCT up to 3.4x when compared to traditional GPU-proportional scheduling.
arXiv Detail & Related papers (2021-10-12T15:25:54Z)
Hybrid Models for Learning to Branch [81.93868699246214]
We propose a new hybrid architecture for efficient branching on CPU machines. The proposed architecture combines the expressive power of GNNs with computationally inexpensive multi-layer perceptrons (MLP) for branching.
arXiv Detail & Related papers (2020-06-26T21:03:45Z)
PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with Pattern-based Weight Pruning [57.20262984116752]
We introduce a new dimension, fine-grained pruning patterns inside the coarse-grained structures, revealing a previously unknown point in design space. With the higher accuracy enabled by fine-grained pruning patterns, the unique insight is to use the compiler to re-gain and guarantee high hardware efficiency.
arXiv Detail & Related papers (2020-01-01T04:52:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.