Efficient LLM Serving on Hybrid Real-time and Best-effort Requests
- URL: http://arxiv.org/abs/2504.09590v1
- Date: Sun, 13 Apr 2025 14:16:57 GMT
- Title: Efficient LLM Serving on Hybrid Real-time and Best-effort Requests
- Authors: Wan Borui, Zhao Juntao, Jiang Chenyu, Guo Chuanxiong, Wu Chuan,
- Abstract summary: BROS is a hybrid Language Models (LLMs) serving system that aims to collocate RT/BE requests, meeting RT requests' latency requirements while maintaining BE requests' throughput.<n>It significantly reduces the latency of RT requests (up to 74.20%), improving their fine-grained service level objectives (SLOs) attainments (up to 36.38x), with negligible throughput reduction for BE requests.
- Score: 0.6291443816903801
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent breakthroughs in large Language Models (LLMs) have enabled various generative tasks on a single model. Real-world services (e.g., OpenAI's ChatGPT [27]) powered by an LLM often concurrently support latency-critical requests for interactive applications (e.g., question-answering systems, referred to as real-time or RT requests) and throughput-oriented requests for back-of-house processing (e.g., documents batch processing [28], referred to best-effort or BE requests), with complex hybrid inference workloads to the underlying model. State-of-the-art (SOTA) LLM serving systems dedicate machines to each type of request, towards either low inference latency or high serving throughput, respectively. This practice simplifies request scheduling and management but suffers from poor resource utilization. We propose BROS, a hybrid LLM serving system that aims to collocate RT/BE requests, meeting RT requests' latency requirements while maintaining BE requests' throughput. BROS formulates the problem of hybrid RT/BE request scheduling and solves it with a dynamic priority-based algorithm. BROS designs a bidirectional KV cache management mechanism, allowing RT requests to share KV memory with BE requests to remove the scheduling restrictions caused by insufficient KV memory and improve utilization. Extensive experiments validate that BROS achieves a good trade-off when serving hybrid RT and BE requests. It significantly reduces the latency of RT requests (up to 74.20%), improving their fine-grained service level objectives (SLOs) attainments (up to 36.38x), with negligible throughput reduction for BE requests, showing significant advantages over SOTA systems like vLLM and TGI.
Related papers
- Tempo: Application-aware LLM Serving with Mixed SLO Requirements [7.290735867969561]
We introduce Tempo, a scheduler designed to maximize service gain across diverse LLM workloads.
Our evaluation shows that Tempo improves end-to-end service gain by up to 8.3$times$ achieves and up to 10.3$times$ SLO goodput compared to state-of-the-art designs.
arXiv Detail & Related papers (2025-04-24T05:55:21Z) - KunServe: Elastic and Efficient Large Language Model Serving with Parameter-centric Memory Management [14.760434869268423]
Large language model (LLM) servingcan easily throttle precious GPU memory under load burstor long-generation requests.<n> KVCache centric approaches handle load spikes by dropping, migrating, or swapping KVCache.<n>This paper proposes a parameter-centric approach byselectively dropping replicated parameters to leave preciousmemory for requests.
arXiv Detail & Related papers (2024-12-24T05:07:46Z) - DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution [114.61347672265076]
Development of MLLMs for real-world robots is challenging due to the typically limited computation and memory capacities available on robotic platforms.
We propose a Dynamic Early-Exit Framework for Robotic Vision-Language-Action Model (DeeR) that automatically adjusts the size of the activated MLLM.
DeeR demonstrates significant reductions in computational costs of LLM by 5.2-6.5x and GPU memory of LLM by 2-6x without compromising performance.
arXiv Detail & Related papers (2024-11-04T18:26:08Z) - Diffusion-based Auction Mechanism for Efficient Resource Management in 6G-enabled Vehicular Metaverses [57.010829427434516]
In 6G-enable Vehicular Metaverses, vehicles are represented by Vehicle Twins (VTs), which serve as digital replicas of physical vehicles.
VT tasks are resource-intensive and need to be offloaded to ground Base Stations (BSs) for fast processing.
We propose a learning-based Modified Second-Bid (MSB) auction mechanism to optimize resource allocation between ground BSs and UAVs.
arXiv Detail & Related papers (2024-11-01T04:34:54Z) - Queue management for slo-oriented large language model serving [3.0134961904579094]
We propose QLM, a queue management system for large language model (LLM) serving.<n>QLM maintains batch and interactive requests across different models and SLOs in a request queue.<n>It uses a Request Waiting Time (RWT) Estimator that estimates the waiting times for requests in the request queue.
arXiv Detail & Related papers (2024-06-05T21:17:34Z) - RelayAttention for Efficient Large Language Model Serving with Long System Prompts [59.50256661158862]
This paper aims to improve the efficiency of LLM services that involve long system prompts.
handling these system prompts requires heavily redundant memory accesses in existing causal attention algorithms.
We propose RelayAttention, an attention algorithm that allows reading hidden states from DRAM exactly once for a batch of input tokens.
arXiv Detail & Related papers (2024-02-22T18:58:28Z) - Guaranteed Dynamic Scheduling of Ultra-Reliable Low-Latency Traffic via
Conformal Prediction [72.59079526765487]
The dynamic scheduling of ultra-reliable and low-latency traffic (URLLC) in the uplink can significantly enhance the efficiency of coexisting services.
The main challenge is posed by the uncertainty in the process of URLLC packet generation.
We introduce a novel scheduler for URLLC packets that provides formal guarantees on reliability and latency irrespective of the quality of the URLLC traffic predictor.
arXiv Detail & Related papers (2023-02-15T14:09:55Z) - Optimization of Image Transmission in a Cooperative Semantic
Communication Networks [68.2233384648671]
A semantic communication framework for image transmission is developed.
Servers cooperatively transmit images to a set of users utilizing semantic communication techniques.
A multimodal metric is proposed to measure the correlation between the extracted semantic information and the original image.
arXiv Detail & Related papers (2023-01-01T15:59:13Z) - ReAssigner: A Plug-and-Play Virtual Machine Scheduling Intensifier for
Heterogeneous Requests [14.521969014581728]
A virtual machine scheduling intensifier called Resource Assigner (Reer) is proposed to enhance scheduling efficiency of any given scheduler for heterogeneous requests.
Reer achieves significant scheduling performance improvement compared with some state-of-the-art scheduling methods.
arXiv Detail & Related papers (2022-11-29T14:05:06Z) - Collaborative Intelligent Reflecting Surface Networks with Multi-Agent
Reinforcement Learning [63.83425382922157]
Intelligent reflecting surface (IRS) is envisioned to be widely applied in future wireless networks.
In this paper, we investigate a multi-user communication system assisted by cooperative IRS devices with the capability of energy harvesting.
arXiv Detail & Related papers (2022-03-26T20:37:14Z) - QoS-SLA-Aware Artificial Intelligence Adaptive Genetic Algorithm for
Multi-Request Offloading in Integrated Edge-Cloud Computing System for the
Internet of Vehicles [14.978000952939404]
Internet of Vehicles (IoV) over Vehicular Ad-hoc Networks (VANETS) is an emerging technology enabling the development of smart cities applications for safer, efficient, and pleasant travel.
Considering vehicles limited computational and storage capabilities, applications requests are offloaded into an integrated edge-cloud computing system.
This paper proposes a novel Artificial Intelligence (AI) deadline-SLA-aware genetic algorithm (GA) for multi-request offloading in a heterogeneous edge-cloud computing system.
arXiv Detail & Related papers (2022-01-21T10:11:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.