Efficient and Workload-Aware LLM Serving via Runtime Layer Swapping and KV Cache Resizing
- URL: http://arxiv.org/abs/2506.02006v1
- Date: Sat, 24 May 2025 06:12:31 GMT
- Title: Efficient and Workload-Aware LLM Serving via Runtime Layer Swapping and KV Cache Resizing
- Authors: Zhaoyuan Su, Tingfeng Lan, Zirui Wang, Juncheng Yang, Yue Cheng,
- Abstract summary: MorphServe is a workload-aware LLM serving framework based on morphological adaptation.<n>It reduces average SLO violations by 92.45 percent and improves the P95 TTFT latency by 2.2x-3.9x compared to full-precision serving.
- Score: 15.386746669464964
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Efficiently serving large language models (LLMs) under dynamic and bursty workloads remains a key challenge for real-world deployment. Existing serving frameworks and static model compression techniques fail to adapt to workload fluctuations, leading to either service-level objective (SLO) violations under full-precision serving or persistent accuracy degradation with static quantization. We present MorphServe, a dynamic, workload-aware LLM serving framework based on morphological adaptation. MorphServe introduces two asynchronous, token-level runtime mechanisms: quantized layer swapping, which selectively replaces less impactful layers with quantized alternatives during high-load periods, and pressure-aware KV cache resizing, which dynamically adjusts KV cache capacity in response to memory pressure. These mechanisms enable state-preserving transitions with minimum runtime overhead and are fully compatible with modern scheduling and attention techniques. Extensive experiments on Vicuna and Llama family models with real-world workloads demonstrate that MorphServe reduces average SLO violations by 92.45 percent and improves the P95 TTFT latency by 2.2x-3.9x compared to full-precision serving, without compromising generation quality. These results establish MorphServe as a practical and elastic solution for LLM deployment in dynamic environments.
Related papers
- BucketServe: Bucket-Based Dynamic Batching for Smart and Efficient LLM Inference Serving [3.620158146761518]
BucketServe is a bucket-based dynamic framework designed to optimize inference performance.<n>It can handle 1.93x more request load load attainment of 80% compared with UELLM and demonstrates 1.975x higher system load capacity compared to the UELLM.
arXiv Detail & Related papers (2025-07-23T01:51:48Z) - Adaptive Request Scheduling for CodeLLM Serving with SLA Guarantees [6.110847503516972]
Existing Large Language Models (CodeMs) are increasingly integrated into modern software development.<n>Yet, self-hosted environments remain a significant challenge in resource-constrained serving environments.<n>We propose SABER, a dynamic strategy that predicts per-request SLA feasibility and decisions in real time.<n>Our results demonstrate that SLA-aware, adaptive scheduling is key to robust, high-performance CodeLL serving.
arXiv Detail & Related papers (2025-06-24T14:44:33Z) - RAP: Runtime-Adaptive Pruning for LLM Inference [7.793451450945128]
We propose RAP, an elastic pruning framework driven by reinforcement learning (RL)<n>RAP tracks the evolving ratio between model parameters and KV-cache across practical execution.<n>RAP outperforms state-of-the-art baselines, marking the first time to jointly consider model weights and KVcache on the fly.
arXiv Detail & Related papers (2025-05-22T06:12:42Z) - FlexQuant: A Flexible and Efficient Dynamic Precision Switching Framework for LLM Quantization [18.041828697950812]
We propose FlexQuant, a dynamic precision-switching framework to optimize the trade-off between inference speed and accuracy.<n>Our work provides a comprehensive analysis of quantization strategies, introduces a precision requirement model for optimal switching, and implements efficient fine-grained precision management.<n> Experimental results demonstrate that FlexQuant achieves a 1.3x end-to-end speedup across diverse language tasks with negligible accuracy loss.
arXiv Detail & Related papers (2025-05-21T07:42:53Z) - Apt-Serve: Adaptive Request Scheduling on Hybrid Cache for Scalable LLM Inference Serving [22.66354939370058]
Apt-Serve is a framework designed to enhance effective throughput in large language model (LLM) inference serving systems.<n>A new hybrid cache scheme combines KV cache with a memory-efficient hidden cache for reusable input hidden state vectors, allowing large batch sizes and improving request.<n>We show that Apt-Serve achieves up to 8.8x improvement in effective throughput compared to the state-of-the-art inference serving systems.
arXiv Detail & Related papers (2025-04-10T06:51:23Z) - Model Hemorrhage and the Robustness Limits of Large Language Models [119.46442117681147]
Large language models (LLMs) demonstrate strong performance across natural language processing tasks, yet undergo significant performance degradation when modified for deployment.<n>We define this phenomenon as model hemorrhage - performance decline caused by parameter alterations and architectural changes.
arXiv Detail & Related papers (2025-03-31T10:16:03Z) - MambaIC: State Space Models for High-Performance Learned Image Compression [53.991726013454695]
A high-performance image compression algorithm is crucial for real-time information transmission across numerous fields.<n>Inspired by the effectiveness of state space models (SSMs) in capturing long-range dependencies, we leverage SSMs to address computational inefficiency in existing methods.<n>We propose an enhanced image compression approach through refined context modeling, which we term MambaIC.
arXiv Detail & Related papers (2025-03-16T11:32:34Z) - Sparse Gradient Compression for Fine-Tuning Large Language Models [58.44973963468691]
Fine-tuning large language models (LLMs) for downstream tasks has become increasingly crucial due to their widespread use and the growing availability of open-source models.<n>High memory costs associated with fine-tuning remain a significant challenge, especially as models increase in size.<n>We propose sparse compression gradient (SGC) to address these limitations.
arXiv Detail & Related papers (2025-02-01T04:18:28Z) - AmoebaLLM: Constructing Any-Shape Large Language Models for Efficient and Instant Deployment [13.977849745488339]
AmoebaLLM is a novel framework designed to enable the instant derivation of large language models of arbitrary shapes.
AmoebaLLM significantly facilitates rapid deployment tailored to various platforms and applications.
arXiv Detail & Related papers (2024-11-15T22:02:28Z) - LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs)
Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time.
We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining.
Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z) - ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.<n>This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.<n>We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z) - A-SDM: Accelerating Stable Diffusion through Model Assembly and Feature Inheritance Strategies [51.7643024367548]
Stable Diffusion Model is a prevalent and effective model for text-to-image (T2I) and image-to-image (I2I) generation.
This study focuses on reducing redundant computation in SDM and optimizing the model through both tuning and tuning-free methods.
arXiv Detail & Related papers (2024-05-31T21:47:05Z) - SpotServe: Serving Generative Large Language Models on Preemptible
Instances [64.18638174004151]
SpotServe is the first distributed large language models serving system on preemptible instances.
We show that SpotServe can reduce the P99 tail latency by 2.4 - 9.1x compared with the best existing LLM serving systems.
We also show that SpotServe can leverage the price advantage of preemptive instances, saving 54% monetary cost compared with only using on-demand instances.
arXiv Detail & Related papers (2023-11-27T06:31:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.