AdaServe: SLO-Customized LLM Serving with Fine-Grained Speculative Decoding
- URL: http://arxiv.org/abs/2501.12162v1
- Date: Tue, 21 Jan 2025 14:15:01 GMT
- Title: AdaServe: SLO-Customized LLM Serving with Fine-Grained Speculative Decoding
- Authors: Zikun Li, Zhuofu Chen, Remi Delacourt, Gabriele Oliaro, Zeyu Wang, Qinghan Chen, Shuhuai Lin, April Yang, Zhihao Zhang, Zhuoming Chen, Sean Lai, Xupeng Miao, Zhihao Jia,
- Abstract summary: AdaServe is the first LLM serving system to support SLO customization through fine-grained speculative decoding.<n>AdaServe achieves up to 73% higher SLO attainment and 74% higher goodput compared to state-of-the-art systems.
- Score: 12.377283389338709
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper introduces AdaServe, the first LLM serving system to support SLO customization through fine-grained speculative decoding. AdaServe leverages the logits of a draft model to predict the speculative accuracy of tokens and employs a theoretically optimal algorithm to construct token trees for verification. To accommodate diverse SLO requirements without compromising throughput, AdaServe employs a speculation-and-selection scheme that first constructs candidate token trees for each request and then dynamically selects tokens to meet individual SLO constraints while optimizing throughput. Comprehensive evaluations demonstrate that AdaServe achieves up to 73% higher SLO attainment and 74% higher goodput compared to state-of-the-art systems. These results underscore AdaServe's potential to enhance the efficiency and adaptability of LLM deployments across varied application scenarios.
Related papers
- SLOs-Serve: Optimized Serving of Multi-SLO LLMs [11.102801440968706]
SLOs-Serve is a system designed for serving multi-stage large language model (LLM) requests with application- and stage-specific service level objectives (SLOs)
The key idea behind SLOs-Serve is to customize the allocation of tokens to meet these SLO requirements.
arXiv Detail & Related papers (2025-04-05T17:41:26Z) - AccelGen: Heterogeneous SLO-Guaranteed High-Throughput LLM Inference Serving for Diverse Applications [8.964981700274059]
We propose AccelGen, a high- throughput inference serving system with heterogeneous SLO guarantees for diverse applications.
Trace real experiments demonstrate that AccelGen achieves 1.42-11.21X higher throughput, 1.43-13.71X higher goodput, 37-90% higher SLO attainment, and 1.61-12.22X lower response latency compared to the state-of-the-art approaches.
arXiv Detail & Related papers (2025-03-17T21:47:43Z) - SpecServe: Efficient and SLO-Aware Large Language Model Serving with Adaptive Speculative Decoding [18.45994543035372]
Speculative decoding has emerged as a compelling technique to accelerate Large Language Model inference.
Existing speculative decoding solutions often fail to adapt to varying workloads and system environments.
We introduce SpecServe, an efficient LLM inference system that dynamically adjusts speculative strategies according to real-time request loads.
arXiv Detail & Related papers (2025-03-07T02:27:51Z) - LLM-Lasso: A Robust Framework for Domain-Informed Feature Selection and Regularization [59.75242204923353]
We introduce LLM-Lasso, a framework that leverages large language models (LLMs) to guide feature selection in Lasso regression.
LLMs generate penalty factors for each feature, which are converted into weights for the Lasso penalty using a simple, tunable model.
Features identified as more relevant by the LLM receive lower penalties, increasing their likelihood of being retained in the final model.
arXiv Detail & Related papers (2025-02-15T02:55:22Z) - Hierarchical Autoscaling for Large Language Model Serving with Chiron [2.767894999702707]
Large language model (LLM) serving is becoming an increasingly important workload for cloud providers.<n>Previous autoscalers for LLM serving do not consider request SLOs leading to unnecessary scaling and resource under-utilization.<n>We introduce Chiron, an autoscaler that uses the idea of hierarchical backpressure estimated using queue size, utilization, and SLOs.
arXiv Detail & Related papers (2025-01-14T12:57:40Z) - Federated Fine-Tuning of LLMs: Framework Comparison and Research Directions [59.5243730853157]
Federated learning (FL) provides a privacy-preserving solution for fine-tuning pre-trained large language models (LLMs) using distributed private datasets.<n>This article conducts a comparative analysis of three advanced federated LLM (FedLLM) frameworks that integrate knowledge distillation (KD) and split learning (SL) to mitigate these issues.
arXiv Detail & Related papers (2025-01-08T11:37:06Z) - Adaptive Pruning for Large Language Models with Structural Importance Awareness [66.2690963378878]
Large language models (LLMs) have significantly improved language understanding and generation capabilities.<n>LLMs are difficult to deploy on resource-constrained edge devices due to their high computational and storage resource demands.<n>We propose structurally-aware adaptive pruning (SAAP) to significantly reduce the computational and memory costs while maintaining model performance.
arXiv Detail & Related papers (2024-12-19T18:08:04Z) - LLM-based Optimization of Compound AI Systems: A Survey [64.39860384538338]
In a compound AI system, components such as an LLM call, a retriever, a code interpreter, or tools are interconnected.
Recent advancements enable end-to-end optimization of these parameters using an LLM.
This paper presents a survey of the principles and emerging trends in LLM-based optimization of compound AI systems.
arXiv Detail & Related papers (2024-10-21T18:06:25Z) - Revisiting SLO and Goodput Metrics in LLM Serving [17.777554083636716]
Service level objectives (SLOs) and goodput-the number of requests that meet SLOs per second-are introduced to evaluate the performance of LLM serving.
Existing metrics fail to capture the nature of user experience.
We propose a unified metric framework smooth goodput including SLOs and goodput to reflect the nature of user experience.
arXiv Detail & Related papers (2024-10-18T08:05:37Z) - Efficiency Unleashed: Inference Acceleration for LLM-based Recommender Systems with Speculative Decoding [61.45448947483328]
We introduce Lossless Acceleration via Speculative Decoding for LLM-based Recommender Systems (LASER)
LASER features a Customized Retrieval Pool to enhance retrieval efficiency and Relaxed Verification to improve the acceptance rate of draft tokens.
LASER achieves a 3-5x speedup on public datasets and saves about 67% of computational resources during the online A/B test.
arXiv Detail & Related papers (2024-08-11T02:31:13Z) - Large Language Model as a Catalyst: A Paradigm Shift in Base Station Siting Optimization [62.16747639440893]
Large language models (LLMs) and their associated technologies advance, particularly in the realms of prompt engineering and agent engineering.<n>Our proposed framework incorporates retrieval-augmented generation (RAG) to enhance the system's ability to acquire domain-specific knowledge and generate solutions.
arXiv Detail & Related papers (2024-08-07T08:43:32Z) - ScaleLLM: A Resource-Frugal LLM Serving Framework by Optimizing End-to-End Efficiency [20.33467627548677]
Large language models (LLMs) have surged in popularity and are extensively used in commercial applications.
We conduct a detailed analysis to identify major bottlenecks that impact end-to-end latency in LLM serving systems.
We then propose ScaleLLM, an optimized system for resource-efficient LLM serving.
arXiv Detail & Related papers (2024-07-23T23:37:29Z) - OptLLM: Optimal Assignment of Queries to Large Language Models [12.07164196530872]
We propose a framework for addressing the cost-effective query allocation problem for large language models (LLMs)
Our framework, named OptLLM, provides users with a range of optimal solutions to choose from, aligning with their budget constraints and performance preferences.
To evaluate the effectiveness of OptLLM, we conduct extensive experiments on various types of tasks, including text classification, question answering, sentiment analysis, reasoning, and log parsing.
arXiv Detail & Related papers (2024-05-24T01:05:37Z) - Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models [52.98743860365194]
We propose a new fine-tuning method called Self-Play fIne-tuNing (SPIN)
At the heart of SPIN lies a self-play mechanism, where the LLM refines its capability by playing against instances of itself.
This sheds light on the promise of self-play, enabling the achievement of human-level performance in LLMs without the need for expert opponents.
arXiv Detail & Related papers (2024-01-02T18:53:13Z) - FederatedScope-LLM: A Comprehensive Package for Fine-tuning Large
Language Models in Federated Learning [70.38817963253034]
This paper first discusses these challenges of federated fine-tuning LLMs, and introduces our package FS-LLM as a main contribution.
We provide comprehensive federated parameter-efficient fine-tuning algorithm implementations and versatile programming interfaces for future extension in FL scenarios.
We conduct extensive experiments to validate the effectiveness of FS-LLM and benchmark advanced LLMs with state-of-the-art parameter-efficient fine-tuning algorithms in FL settings.
arXiv Detail & Related papers (2023-09-01T09:40:36Z) - Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes [54.13559879916708]
EVAPORATE is a prototype system powered by large language models (LLMs)
Code synthesis is cheap, but far less accurate than directly processing each document with the LLM.
We propose an extended code implementation, EVAPORATE-CODE+, which achieves better quality than direct extraction.
arXiv Detail & Related papers (2023-04-19T06:00:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.