BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching
- URL: http://arxiv.org/abs/2411.16102v1
- Date: Mon, 25 Nov 2024 05:24:53 GMT
- Title: BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching
- Authors: Yilong Zhao, Shuo Yang, Kan Zhu, Lianmin Zheng, Baris Kasikci, Yang Zhou, Jiarong Xing, Ion Stoica,
- Abstract summary: offline batch inference is becoming more popular for latency-insensitive applications.
We present BlendServe, a system that maximizes resource utilization of offline batch inference.
We show that BlendServe provides up to $1.44times$ throughput boost compared to widely-used industry standards.
- Score: 28.13349943279609
- License:
- Abstract: Offline batch inference, which leverages the flexibility of request batching to achieve higher throughput and lower costs, is becoming more popular for latency-insensitive applications. Meanwhile, recent progress in model capability and modality makes requests more diverse in compute and memory demands, creating unique opportunities for throughput improvement by resource overlapping. However, a request schedule that maximizes resource overlapping can conflict with the schedule that maximizes prefix sharing, a widely-used performance optimization, causing sub-optimal inference throughput. We present BlendServe, a system that maximizes resource utilization of offline batch inference by combining the benefits of resource overlapping and prefix sharing using a resource-aware prefix tree. BlendServe exploits the relaxed latency requirements in offline batch inference to reorder and overlap requests with varied resource demands while ensuring high prefix sharing. We evaluate BlendServe on a variety of synthetic multi-modal workloads and show that it provides up to $1.44\times$ throughput boost compared to widely-used industry standards, vLLM and SGLang.
Related papers
- Topology-aware Preemptive Scheduling for Co-located LLM Workloads [7.240168647854797]
We develop a fine-grained topology-aware method for scheduling of hybrid workloads.
This method significantly increases the efficiency of preemption and improves overall scheduled performance for LLM workloads by $55%$.
arXiv Detail & Related papers (2024-11-18T13:26:09Z) - A Distributed Neural Linear Thompson Sampling Framework to Achieve URLLC
in Industrial IoT [16.167107624956294]
Industrial Internet of Things (IIoT) networks will provide Ultra-Reliable Low-Latency Communication (URLLC) to support critical processes.
Standard protocols for allocating wireless resources may not optimize the latency-reliability trade-off, especially for uplink communication.
arXiv Detail & Related papers (2023-11-21T12:22:04Z) - Client Orchestration and Cost-Efficient Joint Optimization for
NOMA-Enabled Hierarchical Federated Learning [55.49099125128281]
We propose a non-orthogonal multiple access (NOMA) enabled HFL system under semi-synchronous cloud model aggregation.
We show that the proposed scheme outperforms the considered benchmarks regarding HFL performance improvement and total cost reduction.
arXiv Detail & Related papers (2023-11-03T13:34:44Z) - Vision-based Semantic Communications for Metaverse Services: A Contest
Theoretic Approach [66.10465001046762]
In Metaverse, avatars must be updated and rendered to reflect users' behaviour.
We propose a semantic communication framework to model the interactions between users and MSPs.
We use the semantic communication technique to reduce the amount of data to be transmitted.
arXiv Detail & Related papers (2023-08-15T07:56:33Z) - On Optimal Caching and Model Multiplexing for Large Model Inference [66.50550915522551]
Large Language Models (LLMs) and other large foundation models have achieved noteworthy success, but their size exacerbates existing resource consumption and latency challenges.
We study two approaches for mitigating these challenges: employing a cache to store previous queries and learning a model multiplexer to choose from an ensemble of models for query processing.
arXiv Detail & Related papers (2023-06-03T05:01:51Z) - Dynamic Resource Allocation for Metaverse Applications with Deep
Reinforcement Learning [64.75603723249837]
This work proposes a novel framework to dynamically manage and allocate different types of resources for Metaverse applications.
We first propose an effective solution to divide applications into groups, namely MetaInstances, where common functions can be shared among applications.
Then, to capture the real-time, dynamic, and uncertain characteristics of request arrival and application departure processes, we develop a semi-Markov decision process-based framework.
arXiv Detail & Related papers (2023-02-27T00:30:01Z) - Optimization of Image Transmission in a Cooperative Semantic
Communication Networks [68.2233384648671]
A semantic communication framework for image transmission is developed.
Servers cooperatively transmit images to a set of users utilizing semantic communication techniques.
A multimodal metric is proposed to measure the correlation between the extracted semantic information and the original image.
arXiv Detail & Related papers (2023-01-01T15:59:13Z) - Optimal Resource Allocation for Serverless Queries [8.59568779761598]
Prior work focused on predicting peak allocation while ignoring aggressive trade-offs between resource allocation and run-time.
We introduce a system for optimal resource allocation that can predict performance with aggressive trade-offs, for both new and past observed queries.
arXiv Detail & Related papers (2021-07-19T02:55:48Z) - Deep Reinforcement Learning for Resource Constrained Multiclass
Scheduling in Wireless Networks [0.0]
In our setup, the available limited bandwidth resources are allocated in order to serve randomly arriving service demands.
We propose a distributional Deep Deterministic Policy Gradient (DDPG) algorithm combined with Deep Sets to tackle the problem.
Our proposed algorithm is tested on both synthetic and real data, showing consistent gains against state-of-the-art conventional methods.
arXiv Detail & Related papers (2020-11-27T09:49:38Z) - The Best of Many Worlds: Dual Mirror Descent for Online Allocation
Problems [7.433931244705934]
We consider a data-driven setting in which the reward and resource consumption of each request are generated using an input model unknown to the decision maker.
We design general class of algorithms that attain good performance in various input models without knowing which type of input they are facing.
Our algorithms operate in the Lagrangian dual space: they maintain a dual multiplier for each resource that is updated using online mirror descent.
arXiv Detail & Related papers (2020-11-18T18:39:17Z) - Coordinated Online Learning for Multi-Agent Systems with Coupled
Constraints and Perturbed Utility Observations [91.02019381927236]
We introduce a novel method to steer the agents toward a stable population state, fulfilling the given resource constraints.
The proposed method is a decentralized resource pricing method based on the resource loads resulting from the augmentation of the game's Lagrangian.
arXiv Detail & Related papers (2020-10-21T10:11:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.