Related papers: CascadeServe: Unlocking Model Cascades for Inference Serving

CascadeServe: Unlocking Model Cascades for Inference Serving

URL: http://arxiv.org/abs/2406.14424v1
Date: Thu, 20 Jun 2024 15:47:37 GMT
Title: CascadeServe: Unlocking Model Cascades for Inference Serving
Authors: Ferdi Kossmann, Ziniu Wu, Alex Turk, Nesime Tatbul, Lei Cao, Samuel Madden,
Abstract summary: Machine learning models are increasingly deployed to production, calling for efficient inference serving systems. Efficient inference serving is complicated by two challenges: (i) ML models incur computational costs, and (ii) the request arrival rates of practical applications have frequent, high-accuracy variations. Model cascades are positioned to tackle both of these challenges, as they (i) save work while maintaining accuracy, and (ii) expose a high-resolution trade-off between work and accuracy, allowing for fine-grained adjustments to request arrival rates.
Score: 8.39076781907597
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Machine learning (ML) models are increasingly deployed to production, calling for efficient inference serving systems. Efficient inference serving is complicated by two challenges: (i) ML models incur high computational costs, and (ii) the request arrival rates of practical applications have frequent, high, and sudden variations which make it hard to correctly provision hardware. Model cascades are positioned to tackle both of these challenges, as they (i) save work while maintaining accuracy, and (ii) expose a high-resolution trade-off between work and accuracy, allowing for fine-grained adjustments to request arrival rates. Despite their potential, model cascades haven't been used inside an online serving system. This comes with its own set of challenges, including workload adaption, model replication onto hardware, inference scheduling, request batching, and more. In this work, we propose CascadeServe, which automates and optimizes end-to-end inference serving with cascades. CascadeServe operates in an offline and online phase. In the offline phase, the system pre-computes a gear plan that specifies how to serve inferences online. In the online phase, the gear plan allows the system to serve inferences while making near-optimal adaptations to the query load at negligible decision overheads. We find that CascadeServe saves 2-3x in cost across a wide spectrum of the latency-accuracy space when compared to state-of-the-art baselines on different workloads.

Related papers

Apt-Serve: Adaptive Request Scheduling on Hybrid Cache for Scalable LLM Inference Serving [22.66354939370058]
Apt-Serve is a framework designed to enhance effective throughput in large language model (LLM) inference serving systems. A new hybrid cache scheme combines KV cache with a memory-efficient hidden cache for reusable input hidden state vectors, allowing large batch sizes and improving request. We show that Apt-Serve achieves up to 8.8x improvement in effective throughput compared to the state-of-the-art inference serving systems.
arXiv Detail & Related papers (2025-04-10T06:51:23Z)
Learning to Help in Multi-Class Settings [11.972877486351987]
A hybrid system can be established by augmenting the local model with a server-side model. The proposed Learning to Help (L2H) model trains a server model given a fixed local (client) model. In both L2D and L2H, the training includes learning a rejector at the client to determine when to query the server.
arXiv Detail & Related papers (2025-01-23T16:32:01Z)
FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency. We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs) We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z)
ConServe: Harvesting GPUs for Low-Latency and High-Throughput Large Language Model Serving [15.01982917560918]
This paper proposes to harvest stranded GPU resources for offline LLM inference tasks. We built ConServe, an LLM serving system that contains an execution engine that preempts running offline tasks. Our evaluation demonstrates that ConServe achieves strong performance isolation when co-serving online and offline tasks.
arXiv Detail & Related papers (2024-10-02T04:12:13Z)
SuperServe: Fine-Grained Inference Serving for Unpredictable Workloads [18.461201610784077]
ML inference serving systems need to balance latency and accuracy requirements of an application. We show that SubNetAct simultaneously serves the entire range of models spanning the latency-accuracy tradeoff space. We show that SubNetAct requires upto 2.6x lower memory to serve a vastly-higher number of models than prior state-of-the-art.
arXiv Detail & Related papers (2023-12-27T22:24:11Z)
MultiTASC: A Multi-Tenancy-Aware Scheduler for Cascaded DNN Inference at the Consumer Edge [4.281723404774888]
This work presents MultiTASC, a multi-tenancy-aware scheduler that adaptively controls the decision functions of devices. By explicitly considering device forwarding, our scheduler improves the latency service-level objective (SLO) satisfaction rate by 20-25 percentage points (pp) over state-of-the-art cascade methods.
arXiv Detail & Related papers (2023-06-22T12:04:49Z)
On Optimal Caching and Model Multiplexing for Large Model Inference [66.50550915522551]
Large Language Models (LLMs) and other large foundation models have achieved noteworthy success, but their size exacerbates existing resource consumption and latency challenges. We study two approaches for mitigating these challenges: employing a cache to store previous queries and learning a model multiplexer to choose from an ensemble of models for query processing.
arXiv Detail & Related papers (2023-06-03T05:01:51Z)
Flexible Job Shop Scheduling via Dual Attention Network Based Reinforcement Learning [73.19312285906891]
In flexible job shop scheduling problem (FJSP), operations can be processed on multiple machines, leading to intricate relationships between operations and machines. Recent works have employed deep reinforcement learning (DRL) to learn priority dispatching rules (PDRs) for solving FJSP. This paper presents a novel end-to-end learning framework that weds the merits of self-attention models for deep feature extraction and DRL for scalable decision-making.
arXiv Detail & Related papers (2023-05-09T01:35:48Z)
A GPU-specialized Inference Parameter Server for Large-Scale Deep Recommendation Models [6.823233135936128]
Recommendation systems are crucial for a variety of modern apps and web services, such as news feeds, social networks, e-commerce, search, etc. To achieve peak prediction accuracy, modern recommendation models combine deep learning with terabyte-scale embedding tables to obtain a fine-grained representation of the underlying data. Traditional inference serving architectures require deploying the whole model to standalone servers, which is infeasible at such massive scale.
arXiv Detail & Related papers (2022-10-17T07:36:18Z)
SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines. This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z)
Tailored Learning-Based Scheduling for Kubernetes-Oriented Edge-Cloud System [54.588242387136376]
We introduce KaiS, a learning-based scheduling framework for edge-cloud systems. First, we design a coordinated multi-agent actor-critic algorithm to cater to decentralized request dispatch. Second, for diverse system scales and structures, we use graph neural networks to embed system state information. Third, we adopt a two-time-scale scheduling mechanism to harmonize request dispatch and service orchestration.
arXiv Detail & Related papers (2021-01-17T03:45:25Z)
Understanding Capacity-Driven Scale-Out Neural Recommendation Inference [1.9529164002361878]
This work describes and characterizes scale-out deep learning recommendation inference using data-center serving infrastructure. We find that the latency and compute overheads of distributed inference are largely a result of a model's static embedding table distribution. Even more encouragingly, we show how distributed inference can account for efficiency improvements in data-center scale recommendation serving.
arXiv Detail & Related papers (2020-11-04T00:51:40Z)
Combining Deep Learning and Optimization for Security-Constrained Optimal Power Flow [94.24763814458686]
Security-constrained optimal power flow (SCOPF) is fundamental in power systems. Modeling of APR within the SCOPF problem results in complex large-scale mixed-integer programs. This paper proposes a novel approach that combines deep learning and robust optimization techniques.
arXiv Detail & Related papers (2020-07-14T12:38:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.