Related papers: SuperServe: Fine-Grained Inference Serving for Unpredictable Workloads

SuperServe: Fine-Grained Inference Serving for Unpredictable Workloads

URL: http://arxiv.org/abs/2312.16733v1
Date: Wed, 27 Dec 2023 22:24:11 GMT
Title: SuperServe: Fine-Grained Inference Serving for Unpredictable Workloads
Authors: Alind Khare, Dhruv Garg, Sukrit Kalra, Snigdha Grandhi, Ion Stoica, Alexey Tumanov
Abstract summary: ML inference serving systems need to balance latency and accuracy requirements of an application. We show that SubNetAct simultaneously serves the entire range of models spanning the latency-accuracy tradeoff space. We show that SubNetAct requires upto 2.6x lower memory to serve a vastly-higher number of models than prior state-of-the-art.
Score: 18.461201610784077
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The increasing deployment of ML models on the critical path of production applications in both datacenter and the edge requires ML inference serving systems to serve these models under unpredictable and bursty request arrival rates. Serving models under such conditions requires these systems to strike a careful balance between the latency and accuracy requirements of the application and the overall efficiency of utilization of scarce resources. State-of-the-art systems resolve this tension by either choosing a static point in the latency-accuracy tradeoff space to serve all requests or load specific models on the critical path of request serving. In this work, we instead resolve this tension by simultaneously serving the entire-range of models spanning the latency-accuracy tradeoff space. Our novel mechanism, SubNetAct, achieves this by carefully inserting specialized operators in weight-shared SuperNetworks. These operators enable SubNetAct to dynamically route requests through the network to meet a latency and accuracy target. SubNetAct requires upto 2.6x lower memory to serve a vastly-higher number of models than prior state-of-the-art. In addition, SubNetAct's near-instantaneous actuation of models unlocks the design space of fine-grained, reactive scheduling policies. We explore the design of one such extremely effective policy, SlackFit and instantiate both SubNetAct and SlackFit in a real system, SuperServe. SuperServe achieves 4.67% higher accuracy for the same SLO attainment and 2.85x higher SLO attainment for the same accuracy on a trace derived from the real-world Microsoft Azure Functions workload and yields the best trade-offs on a wide range of extremely-bursty synthetic traces automatically.

Related papers

HyGen: Efficient LLM Serving via Elastic Online-Offline Request Co-location [3.348953136575379]
HyGen is an interference-aware LLM serving system that enables efficient co-location of online and offline workloads. Our evaluation on production workloads shows that HyGen achieves up to 3.87x overall throughput and 5.84x offline throughput gains.
arXiv Detail & Related papers (2025-01-15T16:32:27Z)
Unleashing the Power of Task-Specific Directions in Parameter Efficient Fine-tuning [65.31677646659895]
This paper focuses on the concept of task-specific directions (TSDs)-critical for transitioning large models from pretrained states to task-specific enhancements in PEFT. We introduce a novel approach, LoRA-Dash, which aims to maximize the impact of TSDs during the fine-tuning process, thereby enhancing model performance on targeted tasks.
arXiv Detail & Related papers (2024-09-02T08:10:51Z)
CascadeServe: Unlocking Model Cascades for Inference Serving [8.39076781907597]
Machine learning models are increasingly deployed to production, calling for efficient inference serving systems. Efficient inference serving is complicated by two challenges: (i) ML models incur computational costs, and (ii) the request arrival rates of practical applications have frequent, high-accuracy variations. Model cascades are positioned to tackle both of these challenges, as they (i) save work while maintaining accuracy, and (ii) expose a high-resolution trade-off between work and accuracy, allowing for fine-grained adjustments to request arrival rates.
arXiv Detail & Related papers (2024-06-20T15:47:37Z)
DIET: Customized Slimming for Incompatible Networks in Sequential Recommendation [16.44627200990594]
recommender systems start to deploy models on edges to alleviate network congestion caused by frequent mobile requests. Several studies have leveraged the proximity of edge-side to real-time data, fine-tuning them to create edge-specific models. These methods require substantial on-edge computational resources and frequent network transfers to keep the model up to date. We propose a customizeD slImming framework for incompatiblE neTworks(DIET). DIET deploys the same generic backbone (potentially incompatible for a specific edge) to all devices.
arXiv Detail & Related papers (2024-06-13T04:39:16Z)
Latency-aware Unified Dynamic Networks for Efficient Image Recognition [72.8951331472913]
LAUDNet is a framework to bridge the theoretical and practical efficiency gap in dynamic networks. It integrates three primary dynamic paradigms-spatially adaptive computation, dynamic layer skipping, and dynamic channel skipping. It can notably reduce the latency of models like ResNet by over 50% on platforms such as V100,3090, and TX2 GPUs.
arXiv Detail & Related papers (2023-08-30T10:57:41Z)
Subgraph Stationary Hardware-Software Inference Co-Design [11.17417275752636]
A growing body of research focuses on reaching better latency-accuracy tradeoffs for Machine Learning models. We make a case for applications that operate in dynamically changing deployment scenarios, where no single static point is optimal. We take a hardware-software co-design approach with a real implementation of SGS in SushiAccel and the implementation of a software scheduler SushiSched controlling which SubNets to serve and what to cache in real-time.
arXiv Detail & Related papers (2023-06-21T16:02:52Z)
A Graph Neural Networks based Framework for Topology-Aware Proactive SLA Management in a Latency Critical NFV Application Use-case [0.34376560669160383]
Recent advancements in 5G and 6G have led to the emergence of latency-critical applications delivered via a Network-series (NFV) enabled paradigm. We propose a proactive SLA management framework leveraging Graph Neural Networks (GNN) and Deep Reinforcement Learning (DRL) to balance the trade-off between efficiency and reliability.
arXiv Detail & Related papers (2022-11-10T23:22:05Z)
NASOA: Towards Faster Task-oriented Online Fine-tuning with a Zoo of Models [90.6485663020735]
Fine-tuning from pre-trained ImageNet models has been a simple, effective, and popular approach for various computer vision tasks. We propose a joint Neural Architecture Search and Online Adaption framework named NASOA towards a faster task-oriented fine-tuning.
arXiv Detail & Related papers (2021-08-07T12:03:14Z)
Cocktail: Leveraging Ensemble Learning for Optimized Model Serving in Public Cloud [9.149566952446058]
We proposeCocktail, a costeffective ensembling-based model serving framework. A prototype implementation ofCocktailon the AWS EC2 platform and exhaustive evalua-tions using a variety of workloads demonstrate thatCocktailcan reduce deployment cost by 1.45x.
arXiv Detail & Related papers (2021-06-09T19:23:58Z)
Dynamic Slimmable Network [105.74546828182834]
We develop a dynamic network slimming regime named Dynamic Slimmable Network (DS-Net) Our DS-Net is empowered with the ability of dynamic inference by the proposed double-headed dynamic gate. It consistently outperforms its static counterparts as well as state-of-the-art static and dynamic model compression methods.
arXiv Detail & Related papers (2021-03-24T15:25:20Z)
Adaptive Subcarrier, Parameter, and Power Allocation for Partitioned Edge Learning Over Broadband Channels [69.18343801164741]
partitioned edge learning (PARTEL) implements parameter-server training, a well known distributed learning method, in wireless network. We consider the case of deep neural network (DNN) models which can be trained using PARTEL by introducing some auxiliary variables.
arXiv Detail & Related papers (2020-10-08T15:27:50Z)
Toward fast and accurate human pose estimation via soft-gated skip connections [97.06882200076096]
This paper is on highly accurate and highly efficient human pose estimation. We re-analyze this design choice in the context of improving both the accuracy and the efficiency over the state-of-the-art. Our model achieves state-of-the-art results on the MPII and LSP datasets.
arXiv Detail & Related papers (2020-02-25T18:51:51Z)
Taurus: A Data Plane Architecture for Per-Packet ML [59.1343317736213]
We present the design and implementation of Taurus, a data plane for line-rate inference. Our evaluation of a Taurus switch ASIC shows that Taurus operates orders of magnitude faster than a server-based control plane.
arXiv Detail & Related papers (2020-02-12T09:18:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.