Enabling Efficient Serverless Inference Serving for LLM (Large Language Model) in the Cloud
- URL: http://arxiv.org/abs/2411.15664v1
- Date: Sat, 23 Nov 2024 22:19:37 GMT
- Title: Enabling Efficient Serverless Inference Serving for LLM (Large Language Model) in the Cloud
- Authors: Himel Ghosh,
- Abstract summary: Review report discusses the cold start latency in serverless inference and existing solutions.
System designed to address the cold start problem in serverless inference for large language models.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This review report discusses the cold start latency in serverless inference and existing solutions. It particularly reviews the ServerlessLLM method, a system designed to address the cold start problem in serverless inference for large language models. Traditional serverless approaches struggle with high latency due to the size of LLM checkpoints and the overhead of initializing GPU resources. ServerlessLLM introduces a multitier checkpoint loading system, leveraging underutilized GPU memory and storage to reduce startup times by 6--8x compared to existing methods. It also proposes live inference migration and a startup-time-optimized model scheduler, ensuring efficient resource allocation and minimizing delays. This system significantly improves performance and scalability in serverless environments for LLM workloads. Besides ServerlessLLM, several other methods from recent research literature, including Rainbowcake, are reviewed in this paper. Further discussions explore how FaaS providers tackle cold starts and the possible future scopes.
Related papers
- A Novel Hat-Shaped Device-Cloud Collaborative Inference Framework for Large Language Models [12.644230479753476]
Traditional cloud-based large language models (LLMs) meet high-accuracy requirements, but fall short of critical demands for low delay and enhanced privacy.
We propose HAT, a novel device-cloud collaborative inference framework that leverages the complementary strengths of U-shaped inference and speculative decoding.
We show that HAT achieves promising performance improvements, reducing TTFT by 41% to 54% and TBT by 41% to 77% compared to the baselines.
arXiv Detail & Related papers (2025-03-23T10:54:58Z) - DeServe: Towards Affordable Offline LLM Inference via Decentralization [42.8973830120059]
This paper presents the design of a decentralized offline serving system for large language model (LLM) inference.
utilizing idle GPU resources, our proposed system, DeServe, decentralizes access to LLMs at a lower cost.
Experiments demonstrate that DeServe achieves a 6.7x-12.6x improvement in throughput over existing serving system baselines in such conditions.
arXiv Detail & Related papers (2025-01-04T02:10:50Z) - ScalingNote: Scaling up Retrievers with Large Language Models for Real-World Dense Retrieval [72.2676180980573]
Large Language Models (LLMs) have exhibited superior performance that can be leveraged for scaling up dense retrieval.
We propose ScalingNote, a two-stage method to exploit the scaling potential of LLMs for retrieval while maintaining online query latency.
Our two-stage scaling method outperforms end-to-end models and verifies the scaling law of dense retrieval with LLMs in industrial scenarios.
arXiv Detail & Related papers (2024-11-24T09:27:43Z) - Search for Efficient Large Language Models [52.98684997131108]
Large Language Models (LLMs) have long held sway in the realms of artificial intelligence research.
Weight pruning, quantization, and distillation have been embraced to compress LLMs, targeting memory reduction and inference acceleration.
Most model compression techniques concentrate on weight optimization, overlooking the exploration of optimal architectures.
arXiv Detail & Related papers (2024-09-25T21:32:12Z) - Efficiency Unleashed: Inference Acceleration for LLM-based Recommender Systems with Speculative Decoding [61.45448947483328]
We introduce Lossless Acceleration via Speculative Decoding for LLM-based Recommender Systems (LASER)
LASER features a Customized Retrieval Pool to enhance retrieval efficiency and Relaxed Verification to improve the acceptance rate of draft tokens.
LASER achieves a 3-5x speedup on public datasets and saves about 67% of computational resources during the online A/B test.
arXiv Detail & Related papers (2024-08-11T02:31:13Z) - One Queue Is All You Need: Resolving Head-of-Line Blocking in Large Language Model Serving [2.9164564021428845]
We propose a multi-model queue management framework for large language models (LLMs) serving.
QLM orchestrates the actions of multiple LLM Serving Operations (LSOs) to reduce HOL blocking and maximize attainment.
Evaluation on heterogeneous GPU devices and models with real-world LLM serving dataset shows that QLM improves SLO attainment by 40-90% and throughput by 20-400%.
arXiv Detail & Related papers (2024-06-05T21:17:34Z) - SPES: Towards Optimizing Performance-Resource Trade-Off for Serverless Functions [31.01399126339857]
Serverless computing is gaining traction due to its efficiency and ability to harness on-demand cloud resources.
Existing solutions tend to use over-simplistic strategies for function pre-loading/unloading without full invocation pattern exploitation.
We propose SPES, the first differentiated scheduler for runtime cold start mitigation by optimizing serverless function provision.
arXiv Detail & Related papers (2024-03-26T10:28:41Z) - Communication Efficient ConFederated Learning: An Event-Triggered SAGA
Approach [67.27031215756121]
Federated learning (FL) is a machine learning paradigm that targets model training without gathering the local data over various data sources.
Standard FL, which employs a single server, can only support a limited number of users, leading to degraded learning capability.
In this work, we consider a multi-server FL framework, referred to as emphConfederated Learning (CFL) in order to accommodate a larger number of users.
arXiv Detail & Related papers (2024-02-28T03:27:10Z) - RelayAttention for Efficient Large Language Model Serving with Long System Prompts [59.50256661158862]
This paper aims to improve the efficiency of LLM services that involve long system prompts.
handling these system prompts requires heavily redundant memory accesses in existing causal attention algorithms.
We propose RelayAttention, an attention algorithm that allows reading hidden states from DRAM exactly once for a batch of input tokens.
arXiv Detail & Related papers (2024-02-22T18:58:28Z) - ServerlessLLM: Low-Latency Serverless Inference for Large Language Models [14.754839787728912]
ServerlessLLM is a distributed system designed to support low-latency serverless inference for Large Language Models (LLMs)
By harnessing the substantial near-GPU storage and memory capacities of inference servers, ServerlessLLM achieves effective local checkpoint storage.
Comprehensive evaluations, including microbenchmarks and real-world scenarios, demonstrate that ServerlessLLM dramatically outperforms state-of-the-art serverless systems.
arXiv Detail & Related papers (2024-01-25T17:55:07Z) - Efficient LLM inference solution on Intel GPU [19.154403468201924]
Transformer based Large Language Models (LLMs) have been widely used in many fields.
We propose an efficient LLM inference solution with low latency and high throughput.
Compared with the standard HuggingFace implementation, the proposed solution achieves up to 7x lower token latency and 27x higher throughput.
arXiv Detail & Related papers (2023-12-19T05:40:43Z) - SpotServe: Serving Generative Large Language Models on Preemptible
Instances [64.18638174004151]
SpotServe is the first distributed large language models serving system on preemptible instances.
We show that SpotServe can reduce the P99 tail latency by 2.4 - 9.1x compared with the best existing LLM serving systems.
We also show that SpotServe can leverage the price advantage of preemptive instances, saving 54% monetary cost compared with only using on-demand instances.
arXiv Detail & Related papers (2023-11-27T06:31:17Z) - FusionAI: Decentralized Training and Deploying LLMs with Massive
Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU.
This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z) - Fast Distributed Inference Serving for Large Language Models [12.703624317418237]
We present FastServe, a distributed inference serving system for large language models (LLMs)
FastServe exploits the autoregressive pattern of LLM inference to enable preemption at the granularity of each output token.
We build a system prototype of FastServe and experimental results show that compared to the state-of-the-art solution vLLM, FastServe improves the throughput by up to 31.4x and 17.9x under the same average and tail latency requirements, respectively.
arXiv Detail & Related papers (2023-05-10T06:17:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.