Fast Distributed Inference Serving for Large Language Models
- URL: http://arxiv.org/abs/2305.05920v3
- Date: Wed, 25 Sep 2024 05:57:51 GMT
- Title: Fast Distributed Inference Serving for Large Language Models
- Authors: Bingyang Wu, Yinmin Zhong, Zili Zhang, Shengyu Liu, Fangyue Liu, Yuanhang Sun, Gang Huang, Xuanzhe Liu, Xin Jin,
- Abstract summary: We present FastServe, a distributed inference serving system for large language models (LLMs)
FastServe exploits the autoregressive pattern of LLM inference to enable preemption at the granularity of each output token.
We build a system prototype of FastServe and experimental results show that compared to the state-of-the-art solution vLLM, FastServe improves the throughput by up to 31.4x and 17.9x under the same average and tail latency requirements, respectively.
- Score: 12.703624317418237
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) power a new generation of interactive AI applications exemplified by ChatGPT. The interactive nature of these applications demands low latency for LLM inference. Existing LLM serving systems use run-to-completion processing for inference jobs, which suffers from head-of-line blocking and long latency. We present FastServe, a distributed inference serving system for LLMs. FastServe exploits the autoregressive pattern of LLM inference to enable preemption at the granularity of each output token. FastServe uses preemptive scheduling to minimize latency with a novel skip-join Multi-Level Feedback Queue scheduler. Based on the new semi-information-agnostic setting of LLM inference, the scheduler leverages the input length information to assign an appropriate initial queue for each arrival job to join. The higher priority queues than the joined queue are skipped to reduce demotions. We design an efficient GPU memory management mechanism that proactively offloads and uploads intermediate state between GPU memory and host memory for LLM inference. We build a system prototype of FastServe and experimental results show that compared to the state-of-the-art solution vLLM, FastServe improves the throughput by up to 31.4x and 17.9x under the same average and tail latency requirements, respectively.
Related papers
- ALISE: Accelerating Large Language Model Serving with Speculative Scheduling [7.367068885621016]
Large Language Models (LLMs) represent a revolutionary advancement in the contemporary landscape of artificial general intelligence (AGI)
In this paper, we propose a new efficient LLM inference serving framework, named ALISE.
We show that ALISE improves the throughput of inference serving by up to 1.8x and 2.1x under the same latency constraint on the Alpaca and ShareGPT datasets, respectively.
arXiv Detail & Related papers (2024-10-31T00:58:11Z) - Fast Inference for Augmented Large Language Models [14.195265302357148]
Augmented Large Language Models (LLMs) enhance the capabilities of standalone LLMs by integrating external data sources through API calls.
Traditional size-based scheduling algorithms, such as Shortest Job First (SJF), become less effective at minimizing completion times.
We propose LAMPS, a novel LLM inference framework for augmented LLMs.
arXiv Detail & Related papers (2024-10-23T19:53:30Z) - Efficient LLM Scheduling by Learning to Rank [19.33941579312897]
We show that it is possible to predict the relative ranks of output lengths in a batch of requests, using learning to rank.
We develop a novel scheduler for LLM inference and serving that can approximate the shortest-job-first (SJF) schedule better than existing approaches.
arXiv Detail & Related papers (2024-08-28T13:35:54Z) - Preble: Efficient Distributed Prompt Scheduling for LLM Serving [8.706905652975554]
This paper proposes Preble, the first distributed LLM serving platform that targets and optimize for prompt sharing.
We designed a distributed scheduling system that co-optimizes KV state reuse and computation load-balancing with a new scheduling algorithm and a hierarchical scheduling mechanism.
Our evaluation of Preble with real workloads and request arrival patterns on two open-source LLMs shows that Preble outperforms the SOTA serving systems by 1.5X to 14.5X on average latency and 2X to 10X on p99 latency.
arXiv Detail & Related papers (2024-05-08T06:30:58Z) - Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction [8.705908108054878]
Large models (LLMs) have been driving a new wave of AI applications across numerous domains.
We present a speculative shortest-job-first (SSJF) scheduler that uses a light proxy model to predict LLM output sequence lengths.
arXiv Detail & Related papers (2024-04-12T14:46:15Z) - CALF: Aligning LLMs for Time Series Forecasting via Cross-modal Fine-Tuning [59.88924847995279]
We propose a novel Cross-Modal LLM Fine-Tuning (CALF) framework for MTSF.
To reduce the distribution discrepancy, we develop the cross-modal match module.
CALF establishes state-of-the-art performance for both long-term and short-term forecasting tasks.
arXiv Detail & Related papers (2024-03-12T04:04:38Z) - RelayAttention for Efficient Large Language Model Serving with Long System Prompts [59.50256661158862]
This paper aims to improve the efficiency of LLM services that involve long system prompts.
handling these system prompts requires heavily redundant memory accesses in existing causal attention algorithms.
We propose RelayAttention, an attention algorithm that allows reading hidden states from DRAM exactly once for a batch of input tokens.
arXiv Detail & Related papers (2024-02-22T18:58:28Z) - HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM
Inference [68.59839755875252]
HiRE comprises of two novel components: (i) a compression scheme to cheaply predict top-$k$ rows/columns with high recall, followed by full computation restricted to the predicted subset, and (ii) DA-TOP-$k$: an efficient multi-device approximate top-$k$ operator.
We demonstrate that on a one billion parameter model, HiRE applied to both the softmax as well as feedforward layers, achieves almost matching pretraining and downstream accuracy, and speeds up inference latency by $1.47times$ on a single TPUv5e device.
arXiv Detail & Related papers (2024-02-14T18:04:36Z) - InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory [93.20588235940453]
In this paper, we introduce a training-free memory-based method, InfLLM.
InfLLM stores distant contexts into additional memory units and employs an efficient mechanism to lookup token-relevant units for attention.
Even when the sequence length is scaled to $1,024$K, InfLLM still effectively captures long-distance dependencies.
arXiv Detail & Related papers (2024-02-07T06:50:42Z) - SpotServe: Serving Generative Large Language Models on Preemptible
Instances [64.18638174004151]
SpotServe is the first distributed large language models serving system on preemptible instances.
We show that SpotServe can reduce the P99 tail latency by 2.4 - 9.1x compared with the best existing LLM serving systems.
We also show that SpotServe can leverage the price advantage of preemptive instances, saving 54% monetary cost compared with only using on-demand instances.
arXiv Detail & Related papers (2023-11-27T06:31:17Z) - NumS: Scalable Array Programming for the Cloud [82.827921577004]
We present NumS, an array programming library which optimize NumPy-like expressions on task-based distributed systems.
This is achieved through a novel scheduler called Load Simulated Hierarchical Scheduling (LSHS)
We show that LSHS enhances performance on Ray by decreasing network load by a factor of 2x, requiring 4x less memory, and reducing execution time by 10x on the logistic regression problem.
arXiv Detail & Related papers (2022-06-28T20:13:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.