Online Scheduling for LLM Inference with KV Cache Constraints
- URL: http://arxiv.org/abs/2502.07115v2
- Date: Thu, 13 Feb 2025 12:54:36 GMT
- Title: Online Scheduling for LLM Inference with KV Cache Constraints
- Authors: Patrick Jaillet, Jiashuo Jiang, Chara Podimata, Zijie Zhou,
- Abstract summary: Large Language Model (LLM) inference is an intensive process requiring efficient scheduling to optimize latency and resource utilization.
We propose novel and scheduling algorithms that minimize inference latency while effectively managing the KV cache's memory.
Our results offer a path toward more sustainable and cost-effective LLM deployment.
- Score: 22.155429544207827
- License:
- Abstract: Large Language Model (LLM) inference, where a trained model generates text one word at a time in response to user prompts, is a computationally intensive process requiring efficient scheduling to optimize latency and resource utilization. A key challenge in LLM inference is the management of the Key-Value (KV) cache, which reduces redundant computations but introduces memory constraints. In this work, we model LLM inference with KV cache constraints theoretically and propose novel batching and scheduling algorithms that minimize inference latency while effectively managing the KV cache's memory. We analyze both semi-online and fully online scheduling models, and our results are threefold. First, we provide a polynomial-time algorithm that achieves exact optimality in terms of average latency in the semi-online prompt arrival model. Second, in the fully online case with a stochastic prompt arrival, we introduce an efficient online scheduling algorithm with constant regret. Third, we prove that no algorithm (deterministic or randomized) can achieve a constant competitive ratio in fully online adversarial settings. Our empirical evaluations on a public LLM inference dataset, using the Llama-70B model on A100 GPUs, show that our approach significantly outperforms benchmark algorithms used currently in practice, achieving lower latency while reducing energy consumption. Overall, our results offer a path toward more sustainable and cost-effective LLM deployment.
Related papers
- Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters [27.656263126925815]
We study the scaling of inference-time computation in LLMs.
We find that in both cases, the effectiveness of different approaches to scaling test-time compute critically varies depending on the difficulty of the prompt.
arXiv Detail & Related papers (2024-08-06T17:35:05Z) - MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention [36.49445805074941]
MInference (Milliontokens Inference) is a sparse calculation method designed to accelerate pre-filling of long-sequence processing.
We demonstrate that MInference effectively reduces inference latency by up to 10x for pre-filling on an A100, while maintaining accuracy.
arXiv Detail & Related papers (2024-07-02T17:59:56Z) - Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient [57.9629676017527]
We propose an optimization-based structural pruning on Large-Language Models.
We learn the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model.
Our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU.
arXiv Detail & Related papers (2024-06-15T09:31:03Z) - Latency-aware Unified Dynamic Networks for Efficient Image Recognition [72.8951331472913]
LAUDNet is a framework to bridge the theoretical and practical efficiency gap in dynamic networks.
It integrates three primary dynamic paradigms-spatially adaptive computation, dynamic layer skipping, and dynamic channel skipping.
It can notably reduce the latency of models like ResNet by over 50% on platforms such as V100,3090, and TX2 GPUs.
arXiv Detail & Related papers (2023-08-30T10:57:41Z) - Accelerating Exact Combinatorial Optimization via RL-based
Initialization -- A Case Study in Scheduling [1.3053649021965603]
This research aims to develop an innovative approach that employs machine learning (ML) for addressing optimization problems.
We introduce a novel two-phase RL-to-ILP scheduling framework, which includes three steps: 1) solver as coarse-grain scheduler, 2) solution relaxation and 3) exact solving via ILP.
Our framework demonstrates the same scheduling performance compared with using exact scheduling methods while achieving up to 128 $times$ speed improvements.
arXiv Detail & Related papers (2023-08-19T15:52:43Z) - Maximize to Explore: One Objective Function Fusing Estimation, Planning,
and Exploration [87.53543137162488]
We propose an easy-to-implement online reinforcement learning (online RL) framework called textttMEX.
textttMEX integrates estimation and planning components while balancing exploration exploitation automatically.
It can outperform baselines by a stable margin in various MuJoCo environments with sparse rewards.
arXiv Detail & Related papers (2023-05-29T17:25:26Z) - Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM
Inference Pipeline [22.08897444328099]
Large language models (LLMs) have revolutionized the field of AI, demonstrating unprecedented capacity across various tasks.
In this paper, we propose an efficient LLM inference pipeline that harnesses the power of LLMs.
arXiv Detail & Related papers (2023-05-22T15:36:06Z) - Gradient Sparsification for Efficient Wireless Federated Learning with
Differential Privacy [25.763777765222358]
Federated learning (FL) enables distributed clients to collaboratively train a machine learning model without sharing raw data with each other.
As the model size grows, the training latency due to limited transmission bandwidth and private information degrades while using differential privacy (DP) protection.
We propose sparsification empowered FL framework wireless channels, in over to improve training efficiency without sacrificing convergence performance.
arXiv Detail & Related papers (2023-04-09T05:21:15Z) - Making Linear MDPs Practical via Contrastive Representation Learning [101.75885788118131]
It is common to address the curse of dimensionality in Markov decision processes (MDPs) by exploiting low-rank representations.
We consider an alternative definition of linear MDPs that automatically ensures normalization while allowing efficient representation learning.
We demonstrate superior performance over existing state-of-the-art model-based and model-free algorithms on several benchmarks.
arXiv Detail & Related papers (2022-07-14T18:18:02Z) - Accelerating Deep Learning Classification with Error-controlled
Approximate-key Caching [72.50506500576746]
We propose a novel caching paradigm, that we named approximate-key caching.
While approximate cache hits alleviate DL inference workload and increase the system throughput, they however introduce an approximation error.
We analytically model our caching system performance for classic LRU and ideal caches, we perform a trace-driven evaluation of the expected performance, and we compare the benefits of our proposed approach with the state-of-the-art similarity caching.
arXiv Detail & Related papers (2021-12-13T13:49:11Z) - Combining Deep Learning and Optimization for Security-Constrained
Optimal Power Flow [94.24763814458686]
Security-constrained optimal power flow (SCOPF) is fundamental in power systems.
Modeling of APR within the SCOPF problem results in complex large-scale mixed-integer programs.
This paper proposes a novel approach that combines deep learning and robust optimization techniques.
arXiv Detail & Related papers (2020-07-14T12:38:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.