MultiTASC: A Multi-Tenancy-Aware Scheduler for Cascaded DNN Inference at
the Consumer Edge
- URL: http://arxiv.org/abs/2306.12830v1
- Date: Thu, 22 Jun 2023 12:04:49 GMT
- Title: MultiTASC: A Multi-Tenancy-Aware Scheduler for Cascaded DNN Inference at
the Consumer Edge
- Authors: Sokratis Nikolaidis, Stylianos I. Venieris, Iakovos S. Venieris
- Abstract summary: This work presents MultiTASC, a multi-tenancy-aware scheduler that adaptively controls the decision functions of devices.
By explicitly considering device forwarding, our scheduler improves the latency service-level objective (SLO) satisfaction rate by 20-25 percentage points (pp) over state-of-the-art cascade methods.
- Score: 4.281723404774888
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cascade systems comprise a two-model sequence, with a lightweight model
processing all samples and a heavier, higher-accuracy model conditionally
refining harder samples to improve accuracy. By placing the light model on the
device side and the heavy model on a server, model cascades constitute a widely
used distributed inference approach. With the rapid expansion of intelligent
indoor environments, such as smart homes, the new setting of Multi-Device
Cascade is emerging where multiple and diverse devices are to simultaneously
use a shared heavy model on the same server, typically located within or close
to the consumer environment. This work presents MultiTASC, a
multi-tenancy-aware scheduler that adaptively controls the forwarding decision
functions of the devices in order to maximize the system throughput, while
sustaining high accuracy and low latency. By explicitly considering device
heterogeneity, our scheduler improves the latency service-level objective (SLO)
satisfaction rate by 20-25 percentage points (pp) over state-of-the-art cascade
methods in highly heterogeneous setups, while serving over 40 devices,
showcasing its scalability.
Related papers
- CascadeServe: Unlocking Model Cascades for Inference Serving [8.39076781907597]
Machine learning models are increasingly deployed to production, calling for efficient inference serving systems.
Efficient inference serving is complicated by two challenges: (i) ML models incur computational costs, and (ii) the request arrival rates of practical applications have frequent, high-accuracy variations.
Model cascades are positioned to tackle both of these challenges, as they (i) save work while maintaining accuracy, and (ii) expose a high-resolution trade-off between work and accuracy, allowing for fine-grained adjustments to request arrival rates.
arXiv Detail & Related papers (2024-06-20T15:47:37Z) - Backpropagation-Free Multi-modal On-Device Model Adaptation via Cloud-Device Collaboration [37.456185990843515]
We introduce a Universal On-Device Multi-modal Model Adaptation Framework.
The framework features the Fast Domain Adaptor (FDA) hosted in the cloud, providing tailored parameters for the Lightweight Multi-modal Model on devices.
Our contributions represent a pioneering solution for on-Device Multi-modal Model Adaptation (DMMA)
arXiv Detail & Related papers (2024-05-21T14:42:18Z) - Improving Efficiency of Diffusion Models via Multi-Stage Framework and Tailored Multi-Decoder Architectures [12.703947839247693]
Diffusion models, emerging as powerful deep generative tools, excel in various applications.
However, their remarkable generative performance is hindered by slow training and sampling.
This is due to the necessity of tracking extensive forward and reverse diffusion trajectories.
We present a multi-stage framework inspired by our empirical findings to tackle these challenges.
arXiv Detail & Related papers (2023-12-14T17:48:09Z) - Complexity Matters: Rethinking the Latent Space for Generative Modeling [65.64763873078114]
In generative modeling, numerous successful approaches leverage a low-dimensional latent space, e.g., Stable Diffusion.
In this study, we aim to shed light on this under-explored topic by rethinking the latent space from the perspective of model complexity.
arXiv Detail & Related papers (2023-07-17T07:12:29Z) - Task-Oriented Sensing, Computation, and Communication Integration for
Multi-Device Edge AI [108.08079323459822]
This paper studies a new multi-intelligent edge artificial-latency (AI) system, which jointly exploits the AI model split inference and integrated sensing and communication (ISAC)
We measure the inference accuracy by adopting an approximate but tractable metric, namely discriminant gain.
arXiv Detail & Related papers (2022-07-03T06:57:07Z) - TMS: A Temporal Multi-scale Backbone Design for Speaker Embedding [60.292702363839716]
Current SOTA backbone networks for speaker embedding are designed to aggregate multi-scale features from an utterance with multi-branch network architectures for speaker representation.
We propose an effective temporal multi-scale (TMS) model where multi-scale branches could be efficiently designed in a speaker embedding network almost without increasing computational costs.
arXiv Detail & Related papers (2022-03-17T05:49:35Z) - Parallel Successive Learning for Dynamic Distributed Model Training over
Heterogeneous Wireless Networks [50.68446003616802]
Federated learning (FedL) has emerged as a popular technique for distributing model training over a set of wireless devices.
We develop parallel successive learning (PSL), which expands the FedL architecture along three dimensions.
Our analysis sheds light on the notion of cold vs. warmed up models, and model inertia in distributed machine learning.
arXiv Detail & Related papers (2022-02-07T05:11:01Z) - SensiX++: Bringing MLOPs and Multi-tenant Model Serving to Sensory Edge
Devices [69.1412199244903]
We present a multi-tenant runtime for adaptive model execution with integrated MLOps on edge devices, e.g., a camera, a microphone, or IoT sensors.
S SensiX++ operates on two fundamental principles - highly modular componentisation to externalise data operations with clear abstractions and document-centric manifestation for system-wide orchestration.
We report on the overall throughput and quantified benefits of various automation components of SensiX++ and demonstrate its efficacy to significantly reduce operational complexity and lower the effort to deploy, upgrade, reconfigure and serve embedded models on edge devices.
arXiv Detail & Related papers (2021-09-08T22:06:16Z) - Multi-mode Transformer Transducer with Stochastic Future Context [53.005638503544866]
Multi-mode speech recognition models can process longer future context to achieve higher accuracy and when a latency budget is not flexible, the model can still achieve reliable accuracy.
We show that a Multi-mode ASR model rivals, if not surpasses, a set of competitive streaming baselines trained with different latency budgets.
arXiv Detail & Related papers (2021-06-17T18:42:11Z) - Multiple Access in Dynamic Cell-Free Networks: Outage Performance and
Deep Reinforcement Learning-Based Design [24.632250413917816]
In future cell-free (or cell-less) wireless networks, a large number of devices in a geographical area will be served simultaneously by a large number of distributed access points (APs)
We propose a novel dynamic cell-free network architecture to reduce the complexity of joint processing of users' signals in presence of a large number of devices and APs.
In our system setting, the proposed DDPG-DDQN scheme is found to achieve around $78%$ of the rate achievable through an exhaustive search-based design.
arXiv Detail & Related papers (2020-01-29T03:00:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.