MoCA: Memory-Centric, Adaptive Execution for Multi-Tenant Deep Neural
Networks
- URL: http://arxiv.org/abs/2305.05843v1
- Date: Wed, 10 May 2023 02:24:50 GMT
- Title: MoCA: Memory-Centric, Adaptive Execution for Multi-Tenant Deep Neural
Networks
- Authors: Seah Kim, Hasan Genc, Vadim Vadimovich Nikiforov, Krste Asanovi\'c,
Borivoje Nikoli\'c, Yakun Sophia Shao
- Abstract summary: MoCA is an adaptive multi-tenancy system for deep neural networks (DNNs) accelerators.
It dynamically manages shared memory resources of co-located applications to meet their targets.
We demonstrate that MoCA improves the satisfaction rate of the service level agreement (SLA) up to 3.9x (1.8x average), system throughput by 2.3x (1.7x average), and fairness by 1.3x (1.2x average) compared to prior work.
- Score: 3.8537852783718627
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Driven by the wide adoption of deep neural networks (DNNs) across different
application domains, multi-tenancy execution, where multiple DNNs are deployed
simultaneously on the same hardware, has been proposed to satisfy the latency
requirements of different applications while improving the overall system
utilization. However, multi-tenancy execution could lead to undesired
system-level resource contention, causing quality-of-service (QoS) degradation
for latency-critical applications. To address this challenge, we propose MoCA,
an adaptive multi-tenancy system for DNN accelerators. Unlike existing
solutions that focus on compute resource partition, MoCA dynamically manages
shared memory resources of co-located applications to meet their QoS targets.
Specifically, MoCA leverages the regularities in both DNN operators and
accelerators to dynamically modulate memory access rates based on their latency
targets and user-defined priorities so that co-located applications get the
resources they demand without significantly starving their co-runners. We
demonstrate that MoCA improves the satisfaction rate of the service level
agreement (SLA) up to 3.9x (1.8x average), system throughput by 2.3x (1.7x
average), and fairness by 1.3x (1.2x average), compared to prior work.
Related papers
- Resource-Efficient Sensor Fusion via System-Wide Dynamic Gated Neural Networks [16.0018681576301]
We propose a novel algorithmic strategy called Quantile-constrained Inference (QIC)
QIC makes joint, high-quality, swift decisions on all the above aspects of the system.
Our results confirm that QIC matches the optimum and outperforms its alternatives by over 80%.
arXiv Detail & Related papers (2024-10-22T06:12:04Z) - FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - DNN Partitioning, Task Offloading, and Resource Allocation in Dynamic Vehicular Networks: A Lyapunov-Guided Diffusion-Based Reinforcement Learning Approach [49.56404236394601]
We formulate the problem of joint DNN partitioning, task offloading, and resource allocation in Vehicular Edge Computing.
Our objective is to minimize the DNN-based task completion time while guaranteeing the system stability over time.
We propose a Multi-Agent Diffusion-based Deep Reinforcement Learning (MAD2RL) algorithm, incorporating the innovative use of diffusion models.
arXiv Detail & Related papers (2024-06-11T06:31:03Z) - Deep Reinforcement Learning based Online Scheduling Policy for Deep Neural Network Multi-Tenant Multi-Accelerator Systems [1.7724466261976437]
This paper presents RELMAS, a low-overhead deep reinforcement learning algorithm designed for the online scheduling of DNNs in multi-tenant environments.
The application of RELMAS to a heterogeneous multi-accelerator system resulted in up to a 173% improvement in SLA satisfaction rate.
arXiv Detail & Related papers (2024-04-13T10:13:07Z) - Sparse-DySta: Sparsity-Aware Dynamic and Static Scheduling for Sparse
Multi-DNN Workloads [65.47816359465155]
Running multiple deep neural networks (DNNs) in parallel has become an emerging workload in both edge devices.
We propose Dysta, a novel scheduler that utilizes both static sparsity patterns and dynamic sparsity information for the sparse multi-DNN scheduling.
Our proposed approach outperforms the state-of-the-art methods with up to 10% decrease in latency constraint violation rate and nearly 4X reduction in average normalized turnaround time.
arXiv Detail & Related papers (2023-10-17T09:25:17Z) - A Multi-Head Ensemble Multi-Task Learning Approach for Dynamical
Computation Offloading [62.34538208323411]
We propose a multi-head ensemble multi-task learning (MEMTL) approach with a shared backbone and multiple prediction heads (PHs)
MEMTL outperforms benchmark methods in both the inference accuracy and mean square error without requiring additional training data.
arXiv Detail & Related papers (2023-09-02T11:01:16Z) - Shared Memory-contention-aware Concurrent DNN Execution for Diversely
Heterogeneous System-on-Chips [0.32634122554914]
HaX-CoNN is a novel scheme that characterizes and maps layers in concurrently executing inference workloads.
We evaluate HaX-CoNN on NVIDIA Orin, NVIDIA Xavier, and Qualcomm Snapdragon 865 SOCs.
arXiv Detail & Related papers (2023-08-10T22:47:40Z) - Adaptive DNN Surgery for Selfish Inference Acceleration with On-demand
Edge Resource [25.274288063300844]
Deep Neural Networks (DNNs) have significantly improved the accuracy of intelligent applications on mobile devices.
DNN surgery can enable real-time inference despite the computational limitations of mobile devices.
This paper introduces a novel Decentralized DNN Surgery (DDS) framework.
arXiv Detail & Related papers (2023-06-21T11:32:28Z) - Fluid Batching: Exit-Aware Preemptive Serving of Early-Exit Neural
Networks on Edge NPUs [74.83613252825754]
"smart ecosystems" are being formed where sensing happens concurrently rather than standalone.
This is shifting the on-device inference paradigm towards deploying neural processing units (NPUs) at the edge.
We propose a novel early-exit scheduling that allows preemption at run time to account for the dynamicity introduced by the arrival and exiting processes.
arXiv Detail & Related papers (2022-09-27T15:04:01Z) - Deep Learning-based Resource Allocation For Device-to-Device
Communication [66.74874646973593]
We propose a framework for the optimization of the resource allocation in multi-channel cellular systems with device-to-device (D2D) communication.
A deep learning (DL) framework is proposed, where the optimal resource allocation strategy for arbitrary channel conditions is approximated by deep neural network (DNN) models.
Our simulation results confirm that near-optimal performance can be attained with low time, which underlines the real-time capability of the proposed scheme.
arXiv Detail & Related papers (2020-11-25T14:19:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.