Fast Distributed Inference Serving for Large Language Models
- URL: http://arxiv.org/abs/2305.05920v1
- Date: Wed, 10 May 2023 06:17:50 GMT
- Title: Fast Distributed Inference Serving for Large Language Models
- Authors: Bingyang Wu, Yinmin Zhong, Zili Zhang, Gang Huang, Xuanzhe Liu, Xin
Jin
- Abstract summary: Large language models (LLMs) power a new generation of interactive AI applications exemplified by ChatGPT.
The interactive nature of these applications demand low job completion time (JCT) for model inference.
We present FastServe, a distributed inference serving system for LLMs.
- Score: 12.682341873843882
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) power a new generation of interactive AI
applications exemplified by ChatGPT. The interactive nature of these
applications demand low job completion time (JCT) for model inference. Existing
LLM serving systems use run-to-completion processing for inference jobs, which
suffers from head-of-line blocking and long JCT. We present FastServe, a
distributed inference serving system for LLMs. FastServe exploits the
autoregressive pattern of LLM inference to enable preemption at the granularity
of each output token. FastServe uses preemptive scheduling to minimize JCT with
a novel skip-join Multi-Level Feedback Queue scheduler. Based on the new semi
information-agnostic setting of LLM inference, the scheduler leverages the
input length information to assign an appropriate initial queue for each
arrival job to join. The higher priority queues than the joined queue are
skipped to reduce demotions. We design an efficient GPU memory management
mechanism that proactively offloads and uploads intermediate states between GPU
memory and host memory for LLM inference. We build a system prototype of
FastServe based on NVIDIA FasterTransformer. Experimental results show that
compared to the state-of-the-art solution Orca, FastServe improves the average
and tail JCT by up to 5.1$\times$ and 6.4$\times$, respectively.
Related papers
- One Queue Is All You Need: Resolving Head-of-Line Blocking in Large Language Model Serving [2.9164564021428845]
We propose a multi-model queue management framework for large language models (LLMs) serving.
QLM orchestrates the actions of multiple LLM Serving Operations (LSOs) to reduce HOL blocking and maximize attainment.
Evaluation on heterogeneous GPU devices and models with real-world LLM serving dataset shows that QLM improves SLO attainment by 40-90% and throughput by 20-400%.
arXiv Detail & Related papers (2024-06-05T21:17:34Z) - Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction [8.705908108054878]
Large models (LLMs) have been driving a new wave of AI applications across numerous domains.
We present a speculative shortest-job-first (SSJF) scheduler that uses a light proxy model to predict LLM output sequence lengths.
arXiv Detail & Related papers (2024-04-12T14:46:15Z) - RelayAttention for Efficient Large Language Model Serving with Long System Prompts [59.50256661158862]
This paper aims to improve the efficiency of LLM services that involve long system prompts.
handling these system prompts requires heavily redundant memory accesses in existing causal attention algorithms.
We propose RelayAttention, an attention algorithm that allows reading hidden states from DRAM exactly once for a batch of input tokens.
arXiv Detail & Related papers (2024-02-22T18:58:28Z) - HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM
Inference [68.59839755875252]
HiRE comprises of two novel components: (i) a compression scheme to cheaply predict top-$k$ rows/columns with high recall, followed by full computation restricted to the predicted subset, and (ii) DA-TOP-$k$: an efficient multi-device approximate top-$k$ operator.
We demonstrate that on a one billion parameter model, HiRE applied to both the softmax as well as feedforward layers, achieves almost matching pretraining and downstream accuracy, and speeds up inference latency by $1.47times$ on a single TPUv5e device.
arXiv Detail & Related papers (2024-02-14T18:04:36Z) - InferCept: Efficient Intercept Support for Augmented Large Language Model Inference [9.669098954493114]
This paper presents InferCept, the first LLM inference framework targeting augmented LLMs.
InferCept minimizes the GPU resource waste caused by LLM interceptions and dedicates saved memory for serving more requests.
InferCept improves the overall serving throughput by 1.6x-2x and completes 2x more requests per second compared to the state-of-the-art LLM inference systems.
arXiv Detail & Related papers (2024-02-02T19:47:57Z) - Efficient LLM inference solution on Intel GPU [19.154403468201924]
Transformer based Large Language Models (LLMs) have been widely used in many fields.
We propose an efficient LLM inference solution with low latency and high throughput.
Compared with the standard HuggingFace implementation, the proposed solution achieves up to 7x lower token latency and 27x higher throughput.
arXiv Detail & Related papers (2023-12-19T05:40:43Z) - SpotServe: Serving Generative Large Language Models on Preemptible
Instances [64.18638174004151]
SpotServe is the first distributed large language models serving system on preemptible instances.
We show that SpotServe can reduce the P99 tail latency by 2.4 - 9.1x compared with the best existing LLM serving systems.
We also show that SpotServe can leverage the price advantage of preemptive instances, saving 54% monetary cost compared with only using on-demand instances.
arXiv Detail & Related papers (2023-11-27T06:31:17Z) - LLM-Pruner: On the Structural Pruning of Large Language Models [65.02607075556742]
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation.
We tackle the compression of LLMs within the bound of two constraints: being task-agnostic and minimizing the reliance on the original training dataset.
Our method, named LLM-Pruner, adopts structural pruning that selectively removes non-critical coupled structures.
arXiv Detail & Related papers (2023-05-19T12:10:53Z) - ZipLM: Inference-Aware Structured Pruning of Language Models [56.52030193434863]
We propose a novel structured compression approach for large language models (LLMs) called ZipLM.
ZipLM achieves state-of-the-art accuracy-vs-speedup, while matching a set of desired target runtime speedups.
ZipLM produces state-of-the-art compressed models across all settings.
arXiv Detail & Related papers (2023-02-07T18:55:28Z) - NumS: Scalable Array Programming for the Cloud [82.827921577004]
We present NumS, an array programming library which optimize NumPy-like expressions on task-based distributed systems.
This is achieved through a novel scheduler called Load Simulated Hierarchical Scheduling (LSHS)
We show that LSHS enhances performance on Ray by decreasing network load by a factor of 2x, requiring 4x less memory, and reducing execution time by 10x on the logistic regression problem.
arXiv Detail & Related papers (2022-06-28T20:13:40Z) - Multi-model Machine Learning Inference Serving with GPU Spatial
Partitioning [7.05946599544139]
High throughput machine learning (ML) inference servers are critical for online service applications.
These servers must provide a bounded latency for each request to support consistent service-level objective (SLO)
This paper proposes a new ML inference scheduling framework for multi-model ML inference servers.
arXiv Detail & Related papers (2021-09-01T04:46:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.