SERFLOW: A Cross-Service Cost Optimization Framework for SLO-Aware Dynamic ML Inference
- URL: http://arxiv.org/abs/2510.27182v1
- Date: Fri, 31 Oct 2025 05:10:33 GMT
- Title: SERFLOW: A Cross-Service Cost Optimization Framework for SLO-Aware Dynamic ML Inference
- Authors: Zongshun Zhang, Ibrahim Matta,
- Abstract summary: Prior work often overlooks real-world factors, such as Virtual Machine (VM) cold starts, requests under long-tail service time distributions, etc.<n>We model each ML query (request) as traversing an acyclic sequence of stages, wherein each stage constitutes a contiguous block of sparse model parameters ending in an internal or final exit.<n> SERFLOW addresses this challenge by leveraging F-based serverless functions (containers) and using stage-specific resource provisioning that accounts for the fraction of requests exiting at each stage.
- Score: 0.15039745292757667
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Dynamic offloading of Machine Learning (ML) model partitions across different resource orchestration services, such as Function-as-a-Service (FaaS) and Infrastructure-as-a-Service (IaaS), can balance processing and transmission delays while minimizing costs of adaptive inference applications. However, prior work often overlooks real-world factors, such as Virtual Machine (VM) cold starts, requests under long-tail service time distributions, etc. To tackle these limitations, we model each ML query (request) as traversing an acyclic sequence of stages, wherein each stage constitutes a contiguous block of sparse model parameters ending in an internal or final classifier where requests may exit. Since input-dependent exit rates vary, no single resource configuration suits all query distributions. IaaS-based VMs become underutilized when many requests exit early, yet rapidly scaling to handle request bursts reaching deep layers is impractical. SERFLOW addresses this challenge by leveraging FaaS-based serverless functions (containers) and using stage-specific resource provisioning that accounts for the fraction of requests exiting at each stage. By integrating this provisioning with adaptive load balancing across VMs and serverless functions based on request ingestion, SERFLOW reduces cloud costs by over $23\%$ while efficiently adapting to dynamic workloads.
Related papers
- xLLM Technical Report [57.13120905321185]
We introduce xLLM, an intelligent and efficient Large Language Model (LLM) inference framework.<n>xLLM builds a novel decoupled service-engine architecture.<n>xLLM-Engine co-optimizes system and algorithm designs to fully saturate computing resources.
arXiv Detail & Related papers (2025-10-16T13:53:47Z) - CSGO: Generalized Optimization for Cold Start in Wireless Collaborative Edge LLM Systems [62.24576366776727]
We propose a latency-aware scheduling framework to minimize total inference latency.<n>We show that the proposed method significantly reduces cold-start latency compared to baseline strategies.
arXiv Detail & Related papers (2025-08-15T07:49:22Z) - PolyServe: Efficient Multi-SLO Serving at Scale [6.147741784378271]
PolyServe is a novel multi-SLO scheduling policy at scale that maintains high SLO attainment while maximizing throughput.<n> PolyServe achieves 1.23x goodput gain compared to existing policies, achieving up to 92.5% of optimal goodput.
arXiv Detail & Related papers (2025-07-17T05:54:42Z) - Tempo: Application-aware LLM Serving with Mixed SLO Requirements [7.290735867969561]
We introduce Tempo, a scheduler designed to maximize service gain across diverse LLM workloads.<n>Our evaluation shows that Tempo improves end-to-end service gain by up to 8.3$times$ achieves and up to 10.3$times$ SLO goodput compared to state-of-the-art designs.
arXiv Detail & Related papers (2025-04-24T05:55:21Z) - Scalable and Cost-Efficient ML Inference: Parallel Batch Processing with Serverless Functions [0.36832029288386137]
This paper explores how serverless architectures can make large-scale ML inference tasks faster and cost-effective.<n>We demonstrate that serverless parallel processing can reduce execution time by over 95% compared to monolithic approaches, at the same cost.
arXiv Detail & Related papers (2025-01-30T15:47:55Z) - DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution [114.61347672265076]
Development of MLLMs for real-world robots is challenging due to the typically limited computation and memory capacities available on robotic platforms.
We propose a Dynamic Early-Exit Framework for Robotic Vision-Language-Action Model (DeeR) that automatically adjusts the size of the activated MLLM.
DeeR demonstrates significant reductions in computational costs of LLM by 5.2-6.5x and GPU memory of LLM by 2-6x without compromising performance.
arXiv Detail & Related papers (2024-11-04T18:26:08Z) - Queue management for slo-oriented large language model serving [3.0134961904579094]
We propose QLM, a queue management system for large language model (LLM) serving.<n>QLM maintains batch and interactive requests across different models and SLOs in a request queue.<n>It uses a Request Waiting Time (RWT) Estimator that estimates the waiting times for requests in the request queue.
arXiv Detail & Related papers (2024-06-05T21:17:34Z) - Llumnix: Dynamic Scheduling for Large Language Model Serving [17.919408899409113]
Inference serving for large language models (LLMs) is the key to unleashing their potential.
We introduce Llumnix, an LLM serving system that reacts to such heterogeneous and unpredictable requests by runtime rescheduling.
We show that Llumnix improves tail latencies by an order of magnitude, accelerates high-priority requests by up to 1.5x, and delivers up to 36% cost savings.
arXiv Detail & Related papers (2024-06-05T13:20:18Z) - SpotServe: Serving Generative Large Language Models on Preemptible
Instances [64.18638174004151]
SpotServe is the first distributed large language models serving system on preemptible instances.
We show that SpotServe can reduce the P99 tail latency by 2.4 - 9.1x compared with the best existing LLM serving systems.
We also show that SpotServe can leverage the price advantage of preemptive instances, saving 54% monetary cost compared with only using on-demand instances.
arXiv Detail & Related papers (2023-11-27T06:31:17Z) - Client Orchestration and Cost-Efficient Joint Optimization for
NOMA-Enabled Hierarchical Federated Learning [55.49099125128281]
We propose a non-orthogonal multiple access (NOMA) enabled HFL system under semi-synchronous cloud model aggregation.
We show that the proposed scheme outperforms the considered benchmarks regarding HFL performance improvement and total cost reduction.
arXiv Detail & Related papers (2023-11-03T13:34:44Z) - In Situ Framework for Coupling Simulation and Machine Learning with
Application to CFD [51.04126395480625]
Recent years have seen many successful applications of machine learning (ML) to facilitate fluid dynamic computations.
As simulations grow, generating new training datasets for traditional offline learning creates I/O and storage bottlenecks.
This work offers a solution by simplifying this coupling and enabling in situ training and inference on heterogeneous clusters.
arXiv Detail & Related papers (2023-06-22T14:07:54Z) - BSAC-CoEx: Coexistence of URLLC and Distributed Learning Services via Device Selection [46.59702442756128]
High-priority ultra-reliable low latency communication (URLLC) and low-priority distributed learning services run concurrently over a network.<n>We formulate this problem as a Markov decision process and address it via BSAC-CoEx, a framework based on the branching soft actor-critic (BSAC) algorithm.<n>Our solution can significantly decrease the training delays of the distributed learning service while keeping the URLLC availability above its required threshold.
arXiv Detail & Related papers (2022-12-22T15:36:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.