HybridServe: Efficient Serving of Large AI Models with Confidence-Based Cascade Routing
- URL: http://arxiv.org/abs/2505.12566v1
- Date: Sun, 18 May 2025 22:54:16 GMT
- Title: HybridServe: Efficient Serving of Large AI Models with Confidence-Based Cascade Routing
- Authors: Leyang Xue, Yao Fu, Luo Mai, Mahesh K. Marina,
- Abstract summary: We propose HybridServe, a novel hybrid model serving system for giant Deep Neural Networks (DNNs)<n>HybridServe prefers to serve inference requests with energy-efficient smaller models so long as accuracy is not compromised.<n>We show that it reduces energy footprint by up to 19.8x compared to the state-of-the-art DNN model serving systems.
- Score: 18.00696709787761
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Giant Deep Neural Networks (DNNs), have become indispensable for accurate and robust support of large-scale cloud based AI services. However, serving giant DNNs is prohibitively expensive from an energy consumption viewpoint easily exceeding that of training, due to the enormous scale of GPU clusters needed to hold giant DNN model partitions and replicas. Existing approaches can either optimize energy efficiency or inference accuracy but not both. To overcome this status quo, we propose HybridServe, a novel hybrid DNN model serving system that leverages multiple sized versions (small to giant) of the model to be served in tandem. Through a confidence based hybrid model serving dataflow, HybridServe prefers to serve inference requests with energy-efficient smaller models so long as accuracy is not compromised, thereby reducing the number of replicas needed for giant DNNs. HybridServe also features a dataflow planner for efficient partitioning and replication of candidate models to maximize serving system throughput. Experimental results using a prototype implementation of HybridServe show that it reduces energy footprint by up to 19.8x compared to the state-of-the-art DNN model serving systems while matching the accuracy of serving solely with giant DNNs.
Related papers
- Optimizing Distributed Deployment of Mixture-of-Experts Model Inference in Serverless Computing [9.217991144854851]
Mixture-of-Experts (MoE) models have been a dominant type of model architectures nowadays.<n>We study optimized MoE model deployment and distributed inference serving on a serverless platform.<n>Our designs reduce the billed cost of all MoE layers by at least 75.67% compared to CPU clusters.
arXiv Detail & Related papers (2025-01-09T15:29:33Z) - FusedInf: Efficient Swapping of DNN Models for On-Demand Serverless Inference Services on the Edge [2.1119495676190128]
We introduce FusedInf to efficiently swap DNN models for on-demand serverless inference services on the edge.
Our evaluation of popular DNN models showed that creating a single DAG can make the execution of the models up to 14% faster.
arXiv Detail & Related papers (2024-10-28T15:21:23Z) - Dual-Model Distillation for Efficient Action Classification with Hybrid Edge-Cloud Solution [1.8029479474051309]
We design a hybrid edge-cloud solution that leverages the efficiency of smaller models for local processing while deferring to larger, more accurate cloud-based models when necessary.
Specifically, we propose a novel unsupervised data generation method, Dual-Model Distillation (DMD), to train a lightweight switcher model that can predict when the edge model's output is uncertain.
Experimental results on the action classification task show that our framework not only requires less computational overhead, but also improves accuracy compared to using a large model alone.
arXiv Detail & Related papers (2024-10-16T02:06:27Z) - Hybrid SD: Edge-Cloud Collaborative Inference for Stable Diffusion Models [6.015486729281141]
We introduce Hybrid SD, a training-free SDMs inference framework for edge-cloud collaborative inference.
We show that our compressed models achieve state-of-the-art parameter efficiency (225.8M) on edge devices with competitive image quality.
Hybrid SD reduces the cloud cost by 66% with edge-cloud collaborative inference.
arXiv Detail & Related papers (2024-08-13T05:30:41Z) - Towards Robust and Efficient Cloud-Edge Elastic Model Adaptation via Selective Entropy Distillation [56.79064699832383]
We establish a Cloud-Edge Elastic Model Adaptation (CEMA) paradigm in which the edge models only need to perform forward propagation.
In our CEMA, to reduce the communication burden, we devise two criteria to exclude unnecessary samples from uploading to the cloud.
arXiv Detail & Related papers (2024-02-27T08:47:19Z) - A-SDM: Accelerating Stable Diffusion through Redundancy Removal and
Performance Optimization [54.113083217869516]
In this work, we first explore the computational redundancy part of the network.
We then prune the redundancy blocks of the model and maintain the network performance.
Thirdly, we propose a global-regional interactive (GRI) attention to speed up the computationally intensive attention part.
arXiv Detail & Related papers (2023-12-24T15:37:47Z) - Towards a Better Theoretical Understanding of Independent Subnetwork Training [56.24689348875711]
We take a closer theoretical look at Independent Subnetwork Training (IST)
IST is a recently proposed and highly effective technique for solving the aforementioned problems.
We identify fundamental differences between IST and alternative approaches, such as distributed methods with compressed communication.
arXiv Detail & Related papers (2023-06-28T18:14:22Z) - Evaluating Distribution System Reliability with Hyperstructures Graph
Convolutional Nets [74.51865676466056]
We show how graph convolutional networks and hyperstructures representation learning framework can be employed for accurate, reliable, and computationally efficient distribution grid planning.
Our numerical experiments show that the proposed Hyper-GCNNs approach yields substantial gains in computational efficiency.
arXiv Detail & Related papers (2022-11-14T01:29:09Z) - NASOA: Towards Faster Task-oriented Online Fine-tuning with a Zoo of
Models [90.6485663020735]
Fine-tuning from pre-trained ImageNet models has been a simple, effective, and popular approach for various computer vision tasks.
We propose a joint Neural Architecture Search and Online Adaption framework named NASOA towards a faster task-oriented fine-tuning.
arXiv Detail & Related papers (2021-08-07T12:03:14Z) - AppealNet: An Efficient and Highly-Accurate Edge/Cloud Collaborative
Architecture for DNN Inference [16.847204351692632]
AppealNet is a novel edge/cloud collaborative architecture that runs deep learning (DL) tasks more efficiently than state-of-the-art solutions.
For a given input, AppealNet accurately predicts on-the-fly whether it can be successfully processed by the DL model deployed on the resource-constrained edge device.
arXiv Detail & Related papers (2021-05-10T04:13:35Z) - Model Fusion via Optimal Transport [64.13185244219353]
We present a layer-wise model fusion algorithm for neural networks.
We show that this can successfully yield "one-shot" knowledge transfer between neural networks trained on heterogeneous non-i.i.d. data.
arXiv Detail & Related papers (2019-10-12T22:07:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.