LLM Serving Optimization with Variable Prefill and Decode Lengths
- URL: http://arxiv.org/abs/2508.06133v2
- Date: Sun, 31 Aug 2025 15:09:36 GMT
- Title: LLM Serving Optimization with Variable Prefill and Decode Lengths
- Authors: Meixuan Wang, Yinyu Ye, Zijie Zhou,
- Abstract summary: We study the problem of serving LLM (Large Language Model) requests where each request has heterogeneous prefill and decode lengths.<n>We show that this problem is NP-hard due to the interplay of placement constraints, precedence relationships, and linearly increasing memory usage.<n>We propose a novel algorithm based on a new selection metric that efficiently forms batches over time.
- Score: 6.937936394246354
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study the problem of serving LLM (Large Language Model) requests where each request has heterogeneous prefill and decode lengths. In LLM serving, the prefill length corresponds to the input prompt length, which determines the initial memory usage in the KV cache. The decode length refers to the number of output tokens generated sequentially, with each additional token increasing the KV cache memory usage by one unit. Given a set of n requests, our goal is to schedule and process them to minimize the total completion time. We show that this problem is NP-hard due to the interplay of batching, placement constraints, precedence relationships, and linearly increasing memory usage. We then analyze commonly used scheduling strategies in practice, such as First-Come-First-Serve (FCFS) and Shortest-First (SF), and prove that their competitive ratios scale up sublinearly with the memory limit-a significant drawback in real-world settings where memory demand is large. To address this, we propose a novel algorithm based on a new selection metric that efficiently forms batches over time. We prove that this algorithm achieves a constant competitive ratio. Finally, we develop and evaluate a few algorithm variants inspired by this approach, including dynamic programming variants, local search methods, and an LP-based scheduler, demonstrating through comprehensive simulations that they outperform standard baselines while maintaining computational efficiency.
Related papers
- Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints [14.341123057506827]
Large Language Models (LLMs) are indispensable in today's applications, but their inference procedure demands significant computational resources.<n>This paper formulates LLM inference optimization as a multi-stage online scheduling problem.<n>We develop a fluid dynamics approximation to provide a tractable benchmark that guides algorithm design.
arXiv Detail & Related papers (2025-04-15T16:00:21Z) - LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention [26.54297116028556]
Large language models (LLMs) have shown remarkable potential in processing long sequences and complex reasoning tasks.<n>We introduce LServe, an efficient system that accelerates long-sequence LLM serving via hybrid sparse attention.<n>On average, LServe accelerates LLM prefilling by up to 2.9x and decoding by 1.3-2.1x over vLLM.
arXiv Detail & Related papers (2025-02-20T18:59:52Z) - Online Scheduling for LLM Inference with KV Cache Constraints [22.133592174540052]
Large Language Model (LLM) inference is an intensive process requiring efficient scheduling to optimize latency and resource utilization.<n>We propose a novel and theoretically scheduling algorithm that minimizes inference latency while effectively managing the KV cache's memory.
arXiv Detail & Related papers (2025-02-10T23:11:44Z) - Squeezed Attention: Accelerating Long Context Length LLM Inference [61.787865959140994]
We propose Squeezed Attention to accelerate applications where a large portion of the input context is fixed.<n>During inference, we compare query tokens from the user input with the centroids to predict which keys from the fixed context are semantically relevant.<n>We also present a hierarchical version of our algorithm which can reduce the complexity of attention from linear to logarithmic with respect to the fixed context length.
arXiv Detail & Related papers (2024-11-14T18:54:19Z) - Don't Stop Me Now: Embedding Based Scheduling for LLMs [22.099820814682513]
Size-based scheduling algorithms like Shortest Remaining Process Time (SRPT) aim to reduce average request completion time.
We propose a prediction-based SRPT variant with limited preemption designed to account for memory overhead in LLM systems.
arXiv Detail & Related papers (2024-10-01T19:51:07Z) - ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.<n>This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.<n>We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z) - UIO-LLMs: Unbiased Incremental Optimization for Long-Context LLMs [111.05657299071648]
UIO-LLMs is an incremental optimization approach for memory-enhanced transformers under long-context settings.<n>We refine the training process using the Truncated Backpropagation Through Time (TBPTT) algorithm.<n>UIO-LLMs successfully handle long context, such as extending the context window of Llama2-7b-chat from 4K to 100K tokens with minimal 2% additional parameters.
arXiv Detail & Related papers (2024-06-26T08:44:36Z) - Training-Free Exponential Context Extension via Cascading KV Cache [49.608367376911694]
We introduce a novel mechanism that leverages cascading sub-cache buffers to selectively retain the most relevant tokens.<n>Our method reduces prefill stage latency by a factor of 6.8 when compared to flash attention on 1M tokens.
arXiv Detail & Related papers (2024-06-24T03:59:17Z) - Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs [61.40047491337793]
We present Hierarchical cOntext MERging (HOMER), a new training-free scheme designed to overcome the limitations of large language models.
HomeR uses a divide-and-conquer algorithm, dividing long inputs into manageable chunks.
A token reduction technique precedes each merging, ensuring memory usage efficiency.
arXiv Detail & Related papers (2024-04-16T06:34:08Z) - Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models [48.592730159983276]
Prefilling is the computation of the key-value cache for input tokens in the prompt prior to autoregressive generation.
For longer input prompt lengths, prefilling incurs a significant overhead on decoding time.
We propose Prepacking, a simple yet effective method to optimize prefilling computation.
arXiv Detail & Related papers (2024-04-15T07:49:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.