One Queue Is All You Need: Resolving Head-of-Line Blocking in Large Language Model Serving
- URL: http://arxiv.org/abs/2407.00047v1
- Date: Wed, 5 Jun 2024 21:17:34 GMT
- Title: One Queue Is All You Need: Resolving Head-of-Line Blocking in Large Language Model Serving
- Authors: Archit Patke, Dhemath Reddy, Saurabh Jha, Haoran Qiu, Christian Pinto, Shengkun Cui, Chandra Narayanaswami, Zbigniew Kalbarczyk, Ravishankar Iyer,
- Abstract summary: We propose a multi-model queue management framework for large language models (LLMs) serving.
QLM orchestrates the actions of multiple LLM Serving Operations (LSOs) to reduce HOL blocking and maximize attainment.
Evaluation on heterogeneous GPU devices and models with real-world LLM serving dataset shows that QLM improves SLO attainment by 40-90% and throughput by 20-400%.
- Score: 2.9164564021428845
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: $ $Large language models (LLMs) have become an increasingly important workload for cloud providers catering to both enterprise and consumer applications. LLM inference requests from these applications have end-to-end latency SLOs that must be adhered to in production settings. However, existing LLM serving systems focus on optimization objectives such as request serving throughput or request execution latency rather than the end-to-end latency SLOs. Achieving end-to-end SLOs for latency-sensitive requests is challenging due to head-of-line (HOL) blocking in the request queue, which results from bursty arrival rates and insufficient resources. To address the above challenge, we propose QLM, a multi-model queue management framework for LLM serving. QLM uses stochastic programming to orchestrate the actions of multiple LLM Serving Operations (LSOs) to reduce HOL blocking and maximize SLO attainment. Specifically, QLM uses the following LSOs: model swapping, request eviction, GPU-CPU state swapping, load balancing, and warm model start. Evaluation on heterogeneous GPU devices and models with real-world LLM serving dataset shows that QLM improves SLO attainment by 40-90% and throughput by 20-400% while maintaining or improving device utilization compared to other state-of-the-art LLM serving systems.
Related papers
- DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution [114.61347672265076]
Development of MLLMs for real-world robots is challenging due to the typically limited computation and memory capacities available on robotic platforms.
We propose a Dynamic Early-Exit Framework for Robotic Vision-Language-Action Model (DeeR) that automatically adjusts the size of the activated MLLM.
DeeR demonstrates significant reductions in computational costs of LLM by 5.2-6.5x and GPU memory of LLM by 2-6x without compromising performance.
arXiv Detail & Related papers (2024-11-04T18:26:08Z) - Fast Inference for Augmented Large Language Models [14.195265302357148]
Augmented Large Language Models (LLMs) enhance the capabilities of standalone LLMs by integrating external data sources through API calls.
Traditional size-based scheduling algorithms, such as Shortest Job First (SJF), become less effective at minimizing completion times.
We propose LAMPS, a novel LLM inference framework for augmented LLMs.
arXiv Detail & Related papers (2024-10-23T19:53:30Z) - ELMS: Elasticized Large Language Models On Mobile Devices [5.689405542579458]
On-device Large Language Models (LLMs) are revolutionizing mobile AI, enabling applications such as UI automation while addressing privacy concerns.
We introduce ELMS, an on-device LLM service designed to provide elasticity in both the model and prompt dimensions.
A one-time reorder neuroning technique, which utilizes the inherent permutation consistency within transformer models to create high-quality, elastic sub-models.
A dual-head compact language model, which efficiently refines prompts and coordinates the elastic adaptation between the model prompt.
arXiv Detail & Related papers (2024-09-08T06:32:08Z) - Efficient Prompting for LLM-based Generative Internet of Things [88.84327500311464]
Large language models (LLMs) have demonstrated remarkable capacities on various tasks, and integrating the capacities of LLMs into the Internet of Things (IoT) applications has drawn much research attention recently.
Due to security concerns, many institutions avoid accessing state-of-the-art commercial LLM services, requiring the deployment and utilization of open-source LLMs in a local network setting.
We propose a LLM-based Generative IoT (GIoT) system deployed in the local network setting in this study.
arXiv Detail & Related papers (2024-06-14T19:24:00Z) - Llumnix: Dynamic Scheduling for Large Language Model Serving [17.919408899409113]
Inference serving for large language models (LLMs) is the key to unleashing their potential.
We introduce Llumnix, an LLM serving system that reacts to such heterogeneous and unpredictable requests by runtime rescheduling.
We show that Llumnix improves tail latencies by an order of magnitude, accelerates high-priority requests by up to 1.5x, and delivers up to 36% cost savings.
arXiv Detail & Related papers (2024-06-05T13:20:18Z) - Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction [8.705908108054878]
Large models (LLMs) have been driving a new wave of AI applications across numerous domains.
We present a speculative shortest-job-first (SSJF) scheduler that uses a light proxy model to predict LLM output sequence lengths.
arXiv Detail & Related papers (2024-04-12T14:46:15Z) - LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement [79.31084387589968]
Pretrained large language models (LLMs) are currently state-of-the-art for solving the vast majority of natural language processing tasks.
We propose LLM2LLM, a data augmentation strategy that uses a teacher LLM to enhance a small seed dataset.
We achieve improvements up to 24.2% on the GSM8K dataset, 32.6% on CaseHOLD, 32.0% on SNIPS, 52.6% on TREC and 39.8% on SST-2 over regular fine-tuning in the low-data regime.
arXiv Detail & Related papers (2024-03-22T08:57:07Z) - MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT [87.4910758026772]
"Bigger the better" has been the predominant trend in recent Large Language Models (LLMs) development.
This paper explores the "less is more" paradigm by addressing the challenge of designing accurate yet efficient Small Language Models (SLMs) for resource constrained devices.
arXiv Detail & Related papers (2024-02-26T18:59:03Z) - FederatedScope-LLM: A Comprehensive Package for Fine-tuning Large
Language Models in Federated Learning [70.38817963253034]
This paper first discusses these challenges of federated fine-tuning LLMs, and introduces our package FS-LLM as a main contribution.
We provide comprehensive federated parameter-efficient fine-tuning algorithm implementations and versatile programming interfaces for future extension in FL scenarios.
We conduct extensive experiments to validate the effectiveness of FS-LLM and benchmark advanced LLMs with state-of-the-art parameter-efficient fine-tuning algorithms in FL settings.
arXiv Detail & Related papers (2023-09-01T09:40:36Z) - LLM-Pruner: On the Structural Pruning of Large Language Models [65.02607075556742]
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation.
We tackle the compression of LLMs within the bound of two constraints: being task-agnostic and minimizing the reliance on the original training dataset.
Our method, named LLM-Pruner, adopts structural pruning that selectively removes non-critical coupled structures.
arXiv Detail & Related papers (2023-05-19T12:10:53Z) - Fast Distributed Inference Serving for Large Language Models [12.703624317418237]
We present FastServe, a distributed inference serving system for large language models (LLMs)
FastServe exploits the autoregressive pattern of LLM inference to enable preemption at the granularity of each output token.
We build a system prototype of FastServe and experimental results show that compared to the state-of-the-art solution vLLM, FastServe improves the throughput by up to 31.4x and 17.9x under the same average and tail latency requirements, respectively.
arXiv Detail & Related papers (2023-05-10T06:17:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.