Related papers: Preble: Efficient Distributed Prompt Scheduling for LLM Serving

Preble: Efficient Distributed Prompt Scheduling for LLM Serving

URL: http://arxiv.org/abs/2407.00023v1
Date: Wed, 8 May 2024 06:30:58 GMT
Title: Preble: Efficient Distributed Prompt Scheduling for LLM Serving
Authors: Vikranth Srivatsa, Zijian He, Reyna Abhyankar, Dongming Li, Yiying Zhang,
Abstract summary: Many parts of prompts are repetitive across requests, and their attention results can be reused. This paper proposes Preble, the first distributed LLM serving platform that targets and optimize for prompt sharing. Preble outperforms the state-of-the-art average latency by 1.5X to 14.5X and p99 by 2X to 10X.
Score: 8.706905652975554
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Prompts to large language models (LLMs) have evolved beyond simple user questions. For LLMs to solve complex problems, today's practices include domain-specific instructions, illustration of tool usages, and long context, such as textbook chapters in prompts. As such, many parts of prompts are repetitive across requests, and their attention computation results can be reused. However, today's LLM serving systems treat every request in isolation, missing the opportunity of computation reuse. This paper proposes Preble, the first distributed LLM serving platform that targets and optimizes for prompt sharing. We perform a study on five popular LLM workloads. Based on our study results, we designed a distributed scheduling system that co-optimizes computation reuse and load balancing. Our evaluation of Preble on two to 8 GPUs with real workloads and request arrival patterns on two open-source LLM models shows that Preble outperforms the state-of-the-art average latency by 1.5X to 14.5X and p99 by 2X to 10X.

Related papers

Pangu Embedded: An Efficient Dual-system LLM Reasoner with Metacognition [95.54406667705999]
Pangu Embedded is an efficient Large Language Model (LLM) reasoner developed on Ascend Neural Processing Units (NPUs)<n>It addresses the significant computational costs and inference latency challenges prevalent in existing reasoning-optimized LLMs.<n>It delivers rapid responses and state-of-the-art reasoning quality within a single, unified model architecture.
arXiv Detail & Related papers (2025-05-28T14:03:02Z)
Autellix: An Efficient Serving Engine for LLM Agents as General Programs [59.673243129044465]
Large language model (LLM) applications are evolving beyond simple chatbots into dynamic, general-purpose agentic programs. Existing LLM serving systems ignore dependencies between programs and calls, missing significant opportunities for optimization. We introduce Autellix, an LLM serving system that treats programs as first-class citizens to minimize their end-to-end latencies.
arXiv Detail & Related papers (2025-02-19T18:59:30Z)
Multi-Bin Batching for Increasing LLM Inference Throughput [19.652542432683234]
Large language models (LL) grow in popularity improving the efficiency of their systems. requests is a critical step in scheduling jobs on servers. requests often have varying generation lengths, causing resource underutilization. We formalize this problem from a queueing-theoretic perspective, and aim to design a throughput control policy.
arXiv Detail & Related papers (2024-12-03T03:16:12Z)
ALISE: Accelerating Large Language Model Serving with Speculative Scheduling [7.367068885621016]
Large Language Models (LLMs) represent a revolutionary advancement in the contemporary landscape of artificial general intelligence (AGI) In this paper, we propose a new efficient LLM inference serving framework, named ALISE. We show that ALISE improves the throughput of inference serving by up to 1.8x and 2.1x under the same latency constraint on the Alpaca and ShareGPT datasets, respectively.
arXiv Detail & Related papers (2024-10-31T00:58:11Z)
Is the GPU Half-Empty or Half-Full? Practical Scheduling Techniques for LLMs [3.7758841366694353]
We survey scheduling techniques from the literature and from practical serving systems. We find that schedulers from the literature often achieve good performance but introduce significant complexity. In contrast, schedulers in practical deployments often leave easy performance gains on the table but are easy to implement, deploy and configure.
arXiv Detail & Related papers (2024-10-23T13:05:46Z)
Don't Stop Me Now: Embedding Based Scheduling for LLMs [22.099820814682513]
Size-based scheduling algorithms like Shortest Remaining Process Time (SRPT) aim to reduce average request completion time. We propose a prediction-based SRPT variant with limited preemption designed to account for memory overhead in LLM systems.
arXiv Detail & Related papers (2024-10-01T19:51:07Z)
Efficient LLM Scheduling by Learning to Rank [19.33941579312897]
We show that it is possible to predict the relative ranks of output lengths in a batch of requests, using learning to rank. We develop a novel scheduler for LLM inference and serving that can approximate the shortest-job-first (SJF) schedule better than existing approaches.
arXiv Detail & Related papers (2024-08-28T13:35:54Z)
Efficiency Unleashed: Inference Acceleration for LLM-based Recommender Systems with Speculative Decoding [61.45448947483328]
We introduce Lossless Acceleration via Speculative Decoding for LLM-based Recommender Systems (LASER) LASER features a Customized Retrieval Pool to enhance retrieval efficiency and Relaxed Verification to improve the acceptance rate of draft tokens. LASER achieves a 3-5x speedup on public datasets and saves about 67% of computational resources during the online A/B test.
arXiv Detail & Related papers (2024-08-11T02:31:13Z)
Optimizing LLM Queries in Relational Workloads [58.254894049950366]
We show how to optimize Large Language Models (LLMs) inference for analytical workloads that invoke LLMs within relational queries. We implement these optimizations in Apache Spark, with vLLM as the model serving backend. We achieve up to 4.4x improvement in end-to-end latency on a benchmark of diverse LLM-based queries on real datasets.
arXiv Detail & Related papers (2024-03-09T07:01:44Z)
BiLLM: Pushing the Limit of Post-Training Quantization for LLMs [53.31402059062365]
BiLLM is a groundbreaking 1-bit post-training quantization scheme tailored for pretrained large language models. It achieves for the first time high-accuracy inference (e.g. 8.41 perplexity on LLaMA2-70B) with only 1.08-bit weights across various LLMs families.
arXiv Detail & Related papers (2024-02-06T09:26:34Z)
SpotServe: Serving Generative Large Language Models on Preemptible Instances [64.18638174004151]
SpotServe is the first distributed large language models serving system on preemptible instances. We show that SpotServe can reduce the P99 tail latency by 2.4 - 9.1x compared with the best existing LLM serving systems. We also show that SpotServe can leverage the price advantage of preemptive instances, saving 54% monetary cost compared with only using on-demand instances.
arXiv Detail & Related papers (2023-11-27T06:31:17Z)
LaGR-SEQ: Language-Guided Reinforcement Learning with Sample-Efficient Querying [71.86163159193327]
Large language models (LLMs) have recently demonstrated their impressive ability to provide context-aware responses via text. This ability could potentially be used to predict plausible solutions in sequential decision making tasks pertaining to pattern completion. We introduce LaGR, which uses this predictive ability of LLMs to propose solutions to tasks that have been partially completed by a primary reinforcement learning (RL) agent.
arXiv Detail & Related papers (2023-08-21T02:07:35Z)
Fast Distributed Inference Serving for Large Language Models [12.703624317418237]
We present FastServe, a distributed inference serving system for large language models (LLMs) FastServe exploits the autoregressive pattern of LLM inference to enable preemption at the granularity of each output token. We build a system prototype of FastServe and experimental results show that compared to the state-of-the-art solution vLLM, FastServe improves the throughput by up to 31.4x and 17.9x under the same average and tail latency requirements, respectively.
arXiv Detail & Related papers (2023-05-10T06:17:50Z)
Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes [54.13559879916708]
EVAPORATE is a prototype system powered by large language models (LLMs) Code synthesis is cheap, but far less accurate than directly processing each document with the LLM. We propose an extended code implementation, EVAPORATE-CODE+, which achieves better quality than direct extraction.
arXiv Detail & Related papers (2023-04-19T06:00:26Z)
$k$NN Prompting: Beyond-Context Learning with Calibration-Free Nearest Neighbor Inference [75.08572535009276]
In-Context Learning (ICL) formulates target tasks as prompt completion conditioned on in-context demonstrations. $k$NN Prompting first queries LLM with training data for distributed representations, then predicts test instances by simply referring to nearest neighbors. It significantly outperforms state-of-the-art calibration-based methods under comparable few-shot scenario.
arXiv Detail & Related papers (2023-03-24T06:16:29Z)
NumS: Scalable Array Programming for the Cloud [82.827921577004]
We present NumS, an array programming library which optimize NumPy-like expressions on task-based distributed systems. This is achieved through a novel scheduler called Load Simulated Hierarchical Scheduling (LSHS) We show that LSHS enhances performance on Ray by decreasing network load by a factor of 2x, requiring 4x less memory, and reducing execution time by 10x on the logistic regression problem.
arXiv Detail & Related papers (2022-06-28T20:13:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.