MoCA: Memory-Centric, Adaptive Execution for Multi-Tenant Deep Neural
Networks
- URL: http://arxiv.org/abs/2305.05843v1
- Date: Wed, 10 May 2023 02:24:50 GMT
- Title: MoCA: Memory-Centric, Adaptive Execution for Multi-Tenant Deep Neural
Networks
- Authors: Seah Kim, Hasan Genc, Vadim Vadimovich Nikiforov, Krste Asanovi\'c,
Borivoje Nikoli\'c, Yakun Sophia Shao
- Abstract summary: MoCA is an adaptive multi-tenancy system for deep neural networks (DNNs) accelerators.
It dynamically manages shared memory resources of co-located applications to meet their targets.
We demonstrate that MoCA improves the satisfaction rate of the service level agreement (SLA) up to 3.9x (1.8x average), system throughput by 2.3x (1.7x average), and fairness by 1.3x (1.2x average) compared to prior work.
- Score: 3.8537852783718627
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Driven by the wide adoption of deep neural networks (DNNs) across different
application domains, multi-tenancy execution, where multiple DNNs are deployed
simultaneously on the same hardware, has been proposed to satisfy the latency
requirements of different applications while improving the overall system
utilization. However, multi-tenancy execution could lead to undesired
system-level resource contention, causing quality-of-service (QoS) degradation
for latency-critical applications. To address this challenge, we propose MoCA,
an adaptive multi-tenancy system for DNN accelerators. Unlike existing
solutions that focus on compute resource partition, MoCA dynamically manages
shared memory resources of co-located applications to meet their QoS targets.
Specifically, MoCA leverages the regularities in both DNN operators and
accelerators to dynamically modulate memory access rates based on their latency
targets and user-defined priorities so that co-located applications get the
resources they demand without significantly starving their co-runners. We
demonstrate that MoCA improves the satisfaction rate of the service level
agreement (SLA) up to 3.9x (1.8x average), system throughput by 2.3x (1.7x
average), and fairness by 1.3x (1.2x average), compared to prior work.
Related papers
- DNN Partitioning, Task Offloading, and Resource Allocation in Dynamic Vehicular Networks: A Lyapunov-Guided Diffusion-Based Reinforcement Learning Approach [49.56404236394601]
We formulate the problem of joint DNN partitioning, task offloading, and resource allocation in Vehicular Edge Computing.
Our objective is to minimize the DNN-based task completion time while guaranteeing the system stability over time.
We propose a Multi-Agent Diffusion-based Deep Reinforcement Learning (MAD2RL) algorithm, incorporating the innovative use of diffusion models.
arXiv Detail & Related papers (2024-06-11T06:31:03Z) - Deep Reinforcement Learning based Online Scheduling Policy for Deep Neural Network Multi-Tenant Multi-Accelerator Systems [1.7724466261976437]
This paper presents RELMAS, a low-overhead deep reinforcement learning algorithm designed for the online scheduling of DNNs in multi-tenant environments.
The application of RELMAS to a heterogeneous multi-accelerator system resulted in up to a 173% improvement in SLA satisfaction rate.
arXiv Detail & Related papers (2024-04-13T10:13:07Z) - Sparse-DySta: Sparsity-Aware Dynamic and Static Scheduling for Sparse
Multi-DNN Workloads [65.47816359465155]
Running multiple deep neural networks (DNNs) in parallel has become an emerging workload in both edge devices.
We propose Dysta, a novel scheduler that utilizes both static sparsity patterns and dynamic sparsity information for the sparse multi-DNN scheduling.
Our proposed approach outperforms the state-of-the-art methods with up to 10% decrease in latency constraint violation rate and nearly 4X reduction in average normalized turnaround time.
arXiv Detail & Related papers (2023-10-17T09:25:17Z) - A Multi-Head Ensemble Multi-Task Learning Approach for Dynamical
Computation Offloading [62.34538208323411]
We propose a multi-head ensemble multi-task learning (MEMTL) approach with a shared backbone and multiple prediction heads (PHs)
MEMTL outperforms benchmark methods in both the inference accuracy and mean square error without requiring additional training data.
arXiv Detail & Related papers (2023-09-02T11:01:16Z) - Shared Memory-contention-aware Concurrent DNN Execution for Diversely
Heterogeneous System-on-Chips [0.32634122554914]
HaX-CoNN is a novel scheme that characterizes and maps layers in concurrently executing inference workloads.
We evaluate HaX-CoNN on NVIDIA Orin, NVIDIA Xavier, and Qualcomm Snapdragon 865 SOCs.
arXiv Detail & Related papers (2023-08-10T22:47:40Z) - Joint Service Caching, Communication and Computing Resource Allocation in Collaborative MEC Systems: A DRL-based Two-timescale Approach [15.16859210403316]
Meeting the strict Quality of Service (QoS) requirements of terminals has imposed a challenge on Multiaccess Edge Computing (MEC) systems.
We propose a collaborative framework that facilitates resource sharing between the edge servers.
We show that our proposed algorithm outperforms the baseline algorithms in terms of the average switching and cache cost.
arXiv Detail & Related papers (2023-07-19T00:27:49Z) - Adaptive DNN Surgery for Selfish Inference Acceleration with On-demand
Edge Resource [25.274288063300844]
Deep Neural Networks (DNNs) have significantly improved the accuracy of intelligent applications on mobile devices.
DNN surgery can enable real-time inference despite the computational limitations of mobile devices.
This paper introduces a novel Decentralized DNN Surgery (DDS) framework.
arXiv Detail & Related papers (2023-06-21T11:32:28Z) - DynaMIX: Resource Optimization for DNN-Based Real-Time Applications on a
Multi-Tasking System [20.882393722208608]
More and more deep neural networks (DNNs) have been developed and deployed on autonomous vehicles (AVs)
To meet their growing expectations and requirements, AVs should "optimize" use of their limited onboard computing resources for multiple concurrent in-vehicle apps.
We propose Dynamix, which optimize the resource requirement of concurrent apps and aims to maximize execution accuracy.
arXiv Detail & Related papers (2023-02-03T06:33:28Z) - Fluid Batching: Exit-Aware Preemptive Serving of Early-Exit Neural
Networks on Edge NPUs [74.83613252825754]
"smart ecosystems" are being formed where sensing happens concurrently rather than standalone.
This is shifting the on-device inference paradigm towards deploying neural processing units (NPUs) at the edge.
We propose a novel early-exit scheduling that allows preemption at run time to account for the dynamicity introduced by the arrival and exiting processes.
arXiv Detail & Related papers (2022-09-27T15:04:01Z) - Deep Learning-based Resource Allocation For Device-to-Device
Communication [66.74874646973593]
We propose a framework for the optimization of the resource allocation in multi-channel cellular systems with device-to-device (D2D) communication.
A deep learning (DL) framework is proposed, where the optimal resource allocation strategy for arbitrary channel conditions is approximated by deep neural network (DNN) models.
Our simulation results confirm that near-optimal performance can be attained with low time, which underlines the real-time capability of the proposed scheme.
arXiv Detail & Related papers (2020-11-25T14:19:23Z) - Resource Allocation via Model-Free Deep Learning in Free Space Optical
Communications [119.81868223344173]
The paper investigates the general problem of resource allocation for mitigating channel fading effects in Free Space Optical (FSO) communications.
Under this framework, we propose two algorithms that solve FSO resource allocation problems.
arXiv Detail & Related papers (2020-07-27T17:38:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.