Efficient LLM Serving on Hybrid Real-time and Best-effort Requests
- URL: http://arxiv.org/abs/2504.09590v1
- Date: Sun, 13 Apr 2025 14:16:57 GMT
- Title: Efficient LLM Serving on Hybrid Real-time and Best-effort Requests
- Authors: Wan Borui, Zhao Juntao, Jiang Chenyu, Guo Chuanxiong, Wu Chuan,
- Abstract summary: BROS is a hybrid Language Models (LLMs) serving system that aims to collocate RT/BE requests, meeting RT requests' latency requirements while maintaining BE requests' throughput.<n>It significantly reduces the latency of RT requests (up to 74.20%), improving their fine-grained service level objectives (SLOs) attainments (up to 36.38x), with negligible throughput reduction for BE requests.
- Score: 0.6291443816903801
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent breakthroughs in large Language Models (LLMs) have enabled various generative tasks on a single model. Real-world services (e.g., OpenAI's ChatGPT [27]) powered by an LLM often concurrently support latency-critical requests for interactive applications (e.g., question-answering systems, referred to as real-time or RT requests) and throughput-oriented requests for back-of-house processing (e.g., documents batch processing [28], referred to best-effort or BE requests), with complex hybrid inference workloads to the underlying model. State-of-the-art (SOTA) LLM serving systems dedicate machines to each type of request, towards either low inference latency or high serving throughput, respectively. This practice simplifies request scheduling and management but suffers from poor resource utilization. We propose BROS, a hybrid LLM serving system that aims to collocate RT/BE requests, meeting RT requests' latency requirements while maintaining BE requests' throughput. BROS formulates the problem of hybrid RT/BE request scheduling and solves it with a dynamic priority-based algorithm. BROS designs a bidirectional KV cache management mechanism, allowing RT requests to share KV memory with BE requests to remove the scheduling restrictions caused by insufficient KV memory and improve utilization. Extensive experiments validate that BROS achieves a good trade-off when serving hybrid RT and BE requests. It significantly reduces the latency of RT requests (up to 74.20%), improving their fine-grained service level objectives (SLOs) attainments (up to 36.38x), with negligible throughput reduction for BE requests, showing significant advantages over SOTA systems like vLLM and TGI.
Related papers
- Efficient Multimodal Planning Agent for Visual Question-Answering [67.26245301307539]
This paper proposes a method that trains a multimodal planning agent, dynamically decomposing the mRAG pipeline to solve the VQA task.<n>In our experiments, the agent can help reduce redundant computations, cutting search time by over 60% compared to existing methods.
arXiv Detail & Related papers (2026-01-28T14:58:59Z) - SERFLOW: A Cross-Service Cost Optimization Framework for SLO-Aware Dynamic ML Inference [0.15039745292757667]
Prior work often overlooks real-world factors, such as Virtual Machine (VM) cold starts, requests under long-tail service time distributions, etc.<n>We model each ML query (request) as traversing an acyclic sequence of stages, wherein each stage constitutes a contiguous block of sparse model parameters ending in an internal or final exit.<n> SERFLOW addresses this challenge by leveraging F-based serverless functions (containers) and using stage-specific resource provisioning that accounts for the fraction of requests exiting at each stage.
arXiv Detail & Related papers (2025-10-31T05:10:33Z) - REFRAG: Rethinking RAG based Decoding [67.4862300145604]
REFRAG is an efficient decoding framework that compresses, senses, and expands to improve latency in RAG applications.<n>We provide rigorous validation of REFRAG across diverse long-context tasks, including RAG, multi-turn conversations, and long document summarization.
arXiv Detail & Related papers (2025-09-01T03:31:44Z) - CSGO: Generalized Optimization for Cold Start in Wireless Collaborative Edge LLM Systems [62.24576366776727]
We propose a latency-aware scheduling framework to minimize total inference latency.<n>We show that the proposed method significantly reduces cold-start latency compared to baseline strategies.
arXiv Detail & Related papers (2025-08-15T07:49:22Z) - QuantVSR: Low-Bit Post-Training Quantization for Real-World Video Super-Resolution [53.13952833016505]
We propose a low-bit quantization model for real-world video super-resolution (VSR)<n>We use a calibration dataset to measure both spatial and temporal complexity for each layer.<n>We refine the FP and low-bit branches to achieve simultaneous optimization.
arXiv Detail & Related papers (2025-08-06T14:35:59Z) - Tempo: Application-aware LLM Serving with Mixed SLO Requirements [7.290735867969561]
We introduce Tempo, a scheduler designed to maximize service gain across diverse LLM workloads.
Our evaluation shows that Tempo improves end-to-end service gain by up to 8.3$times$ achieves and up to 10.3$times$ SLO goodput compared to state-of-the-art designs.
arXiv Detail & Related papers (2025-04-24T05:55:21Z) - KunServe: Elastic and Efficient Large Language Model Serving with Parameter-centric Memory Management [14.760434869268423]
Large language model (LLM) servingcan easily throttle precious GPU memory under load burstor long-generation requests.<n> KVCache centric approaches handle load spikes by dropping, migrating, or swapping KVCache.<n>This paper proposes a parameter-centric approach byselectively dropping replicated parameters to leave preciousmemory for requests.
arXiv Detail & Related papers (2024-12-24T05:07:46Z) - DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution [114.61347672265076]
Development of MLLMs for real-world robots is challenging due to the typically limited computation and memory capacities available on robotic platforms.
We propose a Dynamic Early-Exit Framework for Robotic Vision-Language-Action Model (DeeR) that automatically adjusts the size of the activated MLLM.
DeeR demonstrates significant reductions in computational costs of LLM by 5.2-6.5x and GPU memory of LLM by 2-6x without compromising performance.
arXiv Detail & Related papers (2024-11-04T18:26:08Z) - Diffusion-based Auction Mechanism for Efficient Resource Management in 6G-enabled Vehicular Metaverses [57.010829427434516]
In 6G-enable Vehicular Metaverses, vehicles are represented by Vehicle Twins (VTs), which serve as digital replicas of physical vehicles.
VT tasks are resource-intensive and need to be offloaded to ground Base Stations (BSs) for fast processing.
We propose a learning-based Modified Second-Bid (MSB) auction mechanism to optimize resource allocation between ground BSs and UAVs.
arXiv Detail & Related papers (2024-11-01T04:34:54Z) - ConServe: Fine-Grained GPU Harvesting for LLM Online and Offline Co-Serving [61.35068981176018]
ConServe is a large language model (LLM) serving system that achieves high throughput and strong online latency guarantees.<n>We show that ConServe delivers an average of 2.2$times$ higher throughput and reduces online serving tail latency by 2.9$times$ on average compared to state-of-the-art systems.
arXiv Detail & Related papers (2024-10-02T04:12:13Z) - Queue management for slo-oriented large language model serving [3.0134961904579094]
We propose QLM, a queue management system for large language model (LLM) serving.<n>QLM maintains batch and interactive requests across different models and SLOs in a request queue.<n>It uses a Request Waiting Time (RWT) Estimator that estimates the waiting times for requests in the request queue.
arXiv Detail & Related papers (2024-06-05T21:17:34Z) - RelayAttention for Efficient Large Language Model Serving with Long System Prompts [59.50256661158862]
This paper aims to improve the efficiency of LLM services that involve long system prompts.
handling these system prompts requires heavily redundant memory accesses in existing causal attention algorithms.
We propose RelayAttention, an attention algorithm that allows reading hidden states from DRAM exactly once for a batch of input tokens.
arXiv Detail & Related papers (2024-02-22T18:58:28Z) - Guaranteed Dynamic Scheduling of Ultra-Reliable Low-Latency Traffic via
Conformal Prediction [72.59079526765487]
The dynamic scheduling of ultra-reliable and low-latency traffic (URLLC) in the uplink can significantly enhance the efficiency of coexisting services.
The main challenge is posed by the uncertainty in the process of URLLC packet generation.
We introduce a novel scheduler for URLLC packets that provides formal guarantees on reliability and latency irrespective of the quality of the URLLC traffic predictor.
arXiv Detail & Related papers (2023-02-15T14:09:55Z) - Optimization of Image Transmission in a Cooperative Semantic
Communication Networks [68.2233384648671]
A semantic communication framework for image transmission is developed.
Servers cooperatively transmit images to a set of users utilizing semantic communication techniques.
A multimodal metric is proposed to measure the correlation between the extracted semantic information and the original image.
arXiv Detail & Related papers (2023-01-01T15:59:13Z) - ReAssigner: A Plug-and-Play Virtual Machine Scheduling Intensifier for
Heterogeneous Requests [14.521969014581728]
A virtual machine scheduling intensifier called Resource Assigner (Reer) is proposed to enhance scheduling efficiency of any given scheduler for heterogeneous requests.
Reer achieves significant scheduling performance improvement compared with some state-of-the-art scheduling methods.
arXiv Detail & Related papers (2022-11-29T14:05:06Z) - Collaborative Intelligent Reflecting Surface Networks with Multi-Agent
Reinforcement Learning [63.83425382922157]
Intelligent reflecting surface (IRS) is envisioned to be widely applied in future wireless networks.
In this paper, we investigate a multi-user communication system assisted by cooperative IRS devices with the capability of energy harvesting.
arXiv Detail & Related papers (2022-03-26T20:37:14Z) - QoS-SLA-Aware Artificial Intelligence Adaptive Genetic Algorithm for
Multi-Request Offloading in Integrated Edge-Cloud Computing System for the
Internet of Vehicles [14.978000952939404]
Internet of Vehicles (IoV) over Vehicular Ad-hoc Networks (VANETS) is an emerging technology enabling the development of smart cities applications for safer, efficient, and pleasant travel.
Considering vehicles limited computational and storage capabilities, applications requests are offloaded into an integrated edge-cloud computing system.
This paper proposes a novel Artificial Intelligence (AI) deadline-SLA-aware genetic algorithm (GA) for multi-request offloading in a heterogeneous edge-cloud computing system.
arXiv Detail & Related papers (2022-01-21T10:11:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.