Related papers: AccelGen: Heterogeneous SLO-Guaranteed High-Throughput LLM Inference Serving for Diverse Applications

AccelGen: Heterogeneous SLO-Guaranteed High-Throughput LLM Inference Serving for Diverse Applications

URL: http://arxiv.org/abs/2503.13737v1
Date: Mon, 17 Mar 2025 21:47:43 GMT
Title: AccelGen: Heterogeneous SLO-Guaranteed High-Throughput LLM Inference Serving for Diverse Applications
Authors: Haiying Shen, Tanmoy Sen,
Abstract summary: We propose AccelGen, a high- throughput inference serving system with heterogeneous SLO guarantees for diverse applications.<n>Trace real experiments demonstrate that AccelGen achieves 1.42-11.21X higher throughput, 1.43-13.71X higher goodput, 37-90% higher SLO attainment, and 1.61-12.22X lower response latency compared to the state-of-the-art approaches.
Score: 8.964981700274059
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we consider a mixed-prompt scenario for a large language model (LLM) inference serving system that supports diverse applications with both short prompts and long prompts and heterogeneous SLOs for iteration time. To improve throughput when handling long prompts, previous research introduces a chunking method, but has not addressed heterogeneous SLOs. To address the limitation, we propose AccelGen, a high-throughput LLM inference serving system with heterogeneous SLO guarantees for diverse applications. AccelGen introduces four core components: (1) SLO-guaranteed dynamic chunking, which dynamically adjusts chunk sizes to maximize GPU compute utilization while meeting iteration-level SLOs; (2) Iteration-level SLO-based task prioritization, which prioritizes tight-SLO requests and batches requests with similar SLOs; (3) Multi-resource-aware batching, which selects queued requests to maximize the utilizations of both GPU compute resource and key-value cache (KVC). Trace-driven real experiments demonstrate that AccelGen achieves 1.42-11.21X higher throughput, 1.43-13.71X higher goodput, 37-90% higher SLO attainment, and 1.61-12.22X lower response latency compared to the state-of-the-art approaches. It achieves performance near the Oracle, which optimally maximizes goodput.

Related papers

Tempo: Application-aware LLM Serving with Mixed SLO Requirements [7.290735867969561]
We introduce Tempo, a scheduler designed to maximize service gain across diverse LLM workloads. Our evaluation shows that Tempo improves end-to-end service gain by up to 8.3$times$ achieves and up to 10.3$times$ SLO goodput compared to state-of-the-art designs.
arXiv Detail & Related papers (2025-04-24T05:55:21Z)
StreamRL: Scalable, Heterogeneous, and Elastic RL for LLMs with Disaggregated Stream Generation [55.75008325187133]
Reinforcement learning (RL) has become the core post-training technique for large language models (LLMs) StreamRL is designed with disaggregation from first principles to address two types of performance bottlenecks. Experiments show that StreamRL improves throughput by up to 2.66x compared to existing state-of-the-art systems.
arXiv Detail & Related papers (2025-04-22T14:19:06Z)
SLOs-Serve: Optimized Serving of Multi-SLO LLMs [11.102801440968706]
SLOs-Serve is a system designed for serving multi-stage large language model (LLM) requests with application- and stage-specific service level objectives (SLOs) The key idea behind SLOs-Serve is to customize the allocation of tokens to meet these SLO requirements.
arXiv Detail & Related papers (2025-04-05T17:41:26Z)
SpecServe: Efficient and SLO-Aware Large Language Model Serving with Adaptive Speculative Decoding [18.45994543035372]
Speculative decoding has emerged as a compelling technique to accelerate Large Language Model inference.<n>Existing speculative decoding solutions often fail to adapt to varying workloads and system environments.<n>We introduce SpecServe, an efficient LLM inference system that dynamically adjusts speculative strategies according to real-time request loads.
arXiv Detail & Related papers (2025-03-07T02:27:51Z)
Autellix: An Efficient Serving Engine for LLM Agents as General Programs [59.673243129044465]
Large language model (LLM) applications are evolving beyond simple chatbots into dynamic, general-purpose agentic programs.<n>Existing LLM serving systems ignore dependencies between programs and calls, missing significant opportunities for optimization.<n>We introduce Autellix, an LLM serving system that treats programs as first-class citizens to minimize their end-to-end latencies.
arXiv Detail & Related papers (2025-02-19T18:59:30Z)
Hierarchical Autoscaling for Large Language Model Serving with Chiron [2.767894999702707]
Large language model (LLM) serving is becoming an increasingly important workload for cloud providers.<n>Previous autoscalers for LLM serving do not consider request SLOs leading to unnecessary scaling and resource under-utilization.<n>We introduce Chiron, an autoscaler that uses the idea of hierarchical backpressure estimated using queue size, utilization, and SLOs.
arXiv Detail & Related papers (2025-01-14T12:57:40Z)
ALISE: Accelerating Large Language Model Serving with Speculative Scheduling [7.367068885621016]
Large Language Models (LLMs) represent a revolutionary advancement in the contemporary landscape of artificial general intelligence (AGI) In this paper, we propose a new efficient LLM inference serving framework, named ALISE. We show that ALISE improves the throughput of inference serving by up to 1.8x and 2.1x under the same latency constraint on the Alpaca and ShareGPT datasets, respectively.
arXiv Detail & Related papers (2024-10-31T00:58:11Z)
FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency. We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs) We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z)
Efficiency Unleashed: Inference Acceleration for LLM-based Recommender Systems with Speculative Decoding [61.45448947483328]
We introduce Lossless Acceleration via Speculative Decoding for LLM-based Recommender Systems (LASER) LASER features a Customized Retrieval Pool to enhance retrieval efficiency and Relaxed Verification to improve the acceptance rate of draft tokens. LASER achieves a 3-5x speedup on public datasets and saves about 67% of computational resources during the online A/B test.
arXiv Detail & Related papers (2024-08-11T02:31:13Z)
Queue management for slo-oriented large language model serving [3.0134961904579094]
We propose QLM, a queue management system for large language model (LLM) serving.<n>QLM maintains batch and interactive requests across different models and SLOs in a request queue.<n>It uses a Request Waiting Time (RWT) Estimator that estimates the waiting times for requests in the request queue.
arXiv Detail & Related papers (2024-06-05T21:17:34Z)
DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference [23.49242865222089]
This paper introduces DeepSpeed-FastGen, a system that delivers up to 2.3x higher effective throughput, 2x lower latency on average, and up to 3.7x lower (token-level) tail latency. We leverage a synergistic combination of DeepSpeed-MII and DeepSpeed-Inference to provide an efficient and easy-to-use serving system for large language models.
arXiv Detail & Related papers (2024-01-09T06:49:40Z)
Distributed Inference and Fine-tuning of Large Language Models Over The Internet [91.00270820533272]
Large language models (LLMs) are useful in many NLP tasks and become more capable with size. These models require high-end hardware, making them inaccessible to most researchers. We develop fault-tolerant inference algorithms and load-balancing protocols that automatically assign devices to maximize the total system throughput.
arXiv Detail & Related papers (2023-12-13T18:52:49Z)
FNAS: Uncertainty-Aware Fast Neural Architecture Search [54.49650267859032]
Reinforcement learning (RL)-based neural architecture search (NAS) generally guarantees better convergence yet suffers from the requirement of huge computational resources. We propose a general pipeline to accelerate the convergence of the rollout process as well as the RL process in NAS. Experiments on the Mobile Neural Architecture Search (MNAS) search space show the proposed Fast Neural Architecture Search (FNAS) accelerates standard RL-based NAS process by 10x.
arXiv Detail & Related papers (2021-05-25T06:32:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.