TurboSpec: Closed-loop Speculation Control System for Optimizing LLM Serving Goodput
- URL: http://arxiv.org/abs/2406.14066v3
- Date: Sun, 27 Jul 2025 03:20:41 GMT
- Title: TurboSpec: Closed-loop Speculation Control System for Optimizing LLM Serving Goodput
- Authors: Xiaoxuan Liu, Jongseok Park, Langxiang Hu, Woosuk Kwon, Zhuohan Li, Chen Zhang, Kuntai Du, Xiangxi Mo, Kaichao You, Alvin Cheung, Zhijie Deng, Ion Stoica, Hao Zhang,
- Abstract summary: Large Language Model (LLM) serving systems batch concurrent user requests to achieve efficient serving.<n>We present TurboSpec, a speculation control system that automatically profiles the execution environment.<n>We demonstrate its effectiveness across diverse workloads and hardware configurations.
- Score: 37.56866491624234
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Model (LLM) serving systems batch concurrent user requests to achieve efficient serving. However, in real-world deployments, such inter-request parallelism from batching is often limited by external factors such as low request rates or memory constraints. Recent works focus on intra-request parallelism from speculative decoding as a solution to this problem. Unfortunately, benefits from intra-request parallelism are often fragile, as speculative decoding causes overhead, and speculated tokens may miss. We observe that speculative decoding may degrade LLM serving performance if added naively without tuning to the incoming requests and the speculation method. To alleviate the need for expert tuning and make speculative decoding more robust, we present TurboSpec, a speculation control system that automatically profiles the execution environment and utilizes a feedback-based algorithm to dynamically adjust the amount of intra-request parallelism in LLM serving. TurboSpec predicts "goodput" - the amount of successfully generated tokens - to evaluate and adjust intra-request parallelism amount to that with the highest goodput in runtime. We implement TurboSpec on a real-world LLM serving system vLLM and demonstrate its effectiveness across diverse workloads and hardware configurations, providing consistent performance improvements across all test scenarios.
Related papers
- Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding [51.711605076319216]
Diffusion-based large language models (Diffusion LLMs) have shown promise for non-autoregressive text generation with parallel decoding capabilities.<n>We introduce a novel block-wise approximate KV Cache mechanism tailored for bidirectional diffusion models, enabling cache reuse with negligible performance drop.<n>We propose a confidence-aware parallel decoding strategy that selectively decodes tokens exceeding a confidence threshold, mitigating dependency violations and maintaining generation quality.
arXiv Detail & Related papers (2025-05-28T17:39:15Z) - Pangu Embedded: An Efficient Dual-system LLM Reasoner with Metacognition [95.54406667705999]
Pangu Embedded is an efficient Large Language Model (LLM) reasoner developed on Ascend Neural Processing Units (NPUs)<n>It addresses the significant computational costs and inference latency challenges prevalent in existing reasoning-optimized LLMs.<n>It delivers rapid responses and state-of-the-art reasoning quality within a single, unified model architecture.
arXiv Detail & Related papers (2025-05-28T14:03:02Z) - Semi-Clairvoyant Scheduling of Speculative Decoding Requests to Minimize LLM Inference Latency [4.372762934308627]
We propose a semi-clairvoyant request scheduling algorithm called Least-Attained/Perceived-Service for Speculative Decoding (LAPS-SD)<n>LAPS-SD can effectively minimize average inference latency by adaptively scheduling requests according to their features during decoding.<n>Experiments show that LAPS-SD reduces inference latency by approximately 39% compared to state-of-the-art scheduling methods.
arXiv Detail & Related papers (2025-05-20T04:12:37Z) - PipeSpec: Breaking Stage Dependencies in Hierarchical LLM Decoding [4.734824660843965]
PipeSpec is a framework that generalizes speculative decoding to $k$ models arranged in a hierarchical pipeline.<n>We show that PipeSpec achieves up to 2.54$times$ speedup while outperforming state-of-the-art methods.
arXiv Detail & Related papers (2025-05-02T20:29:31Z) - Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints [14.341123057506827]
Large Language Models (LLMs) are indispensable in today's applications, but their inference procedure demands significant computational resources.<n>This paper formulates LLM inference optimization as a multi-stage online scheduling problem.<n>We develop a fluid dynamics approximation to provide a tractable benchmark that guides algorithm design.
arXiv Detail & Related papers (2025-04-15T16:00:21Z) - SpecServe: Efficient and SLO-Aware Large Language Model Serving with Adaptive Speculative Decoding [18.45994543035372]
Speculative decoding has emerged as a compelling technique to accelerate Large Language Model inference.
Existing speculative decoding solutions often fail to adapt to varying workloads and system environments.
We introduce SpecServe, an efficient LLM inference system that dynamically adjusts speculative strategies according to real-time request loads.
arXiv Detail & Related papers (2025-03-07T02:27:51Z) - TrimLLM: Progressive Layer Dropping for Domain-Specific LLMs [11.615399679746675]
Specializing large language models (LLMs) for local deployment in domain-specific use cases is necessary for strong performance.
We develop TrimLLM based on the layer-wise specialization phenomenon we empirically observed and verified on contemporary LLMs.
We show it retains LLMs' capacity in specific domains and inference speedup achieves irrespective of hardware and deep learning frameworks.
arXiv Detail & Related papers (2024-12-15T16:47:16Z) - Multi-Bin Batching for Increasing LLM Inference Throughput [19.652542432683234]
Large language models (LL) grow in popularity improving the efficiency of their systems.<n> requests is a critical step in scheduling jobs on servers.<n> requests often have varying generation lengths, causing resource underutilization.<n>We formalize this problem from a queueing-theoretic perspective, and aim to design a throughput control policy.
arXiv Detail & Related papers (2024-12-03T03:16:12Z) - COrAL: Order-Agnostic Language Modeling for Efficient Iterative Refinement [80.18490952057125]
Iterative refinement has emerged as an effective paradigm for enhancing the capabilities of large language models (LLMs) on complex tasks.
We propose Context-Wise Order-Agnostic Language Modeling (COrAL) to overcome these challenges.
Our approach models multiple token dependencies within manageable context windows, enabling the model to perform iterative refinement internally.
arXiv Detail & Related papers (2024-10-12T23:56:19Z) - SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration [10.970637831760136]
Speculative decoding (SD) has emerged as a widely used paradigm to accelerate the inference of large language models (LLMs)
We introduce SWIFT, an on-the-fly self-speculative decoding algorithm that adaptively selects intermediate layers of LLMs to skip during inference.
We show that SWIFT can achieve over a 1.3x-1.6x speedup while preserving the original distribution of the generated text.
arXiv Detail & Related papers (2024-10-09T14:15:30Z) - ParallelSpec: Parallel Drafter for Efficient Speculative Decoding [62.68430939686566]
We present ParallelSpec, an alternative to auto-regressive drafting strategies in state-of-the-art speculative decoding approaches.
In contrast to auto-regressive drafting in the speculative stage, we train a parallel drafter to serve as an efficient speculative model.
arXiv Detail & Related papers (2024-10-08T01:05:08Z) - Reference Trustable Decoding: A Training-Free Augmentation Paradigm for Large Language Models [79.41139393080736]
Large language models (LLMs) have rapidly advanced and demonstrated impressive capabilities.
In-Context Learning (ICL) and.
Efficient Fine-Tuning (PEFT) are currently two mainstream methods for augmenting.
LLMs to downstream tasks.
We propose Reference Trustable Decoding (RTD), a paradigm that allows models to quickly adapt to new tasks without fine-tuning.
arXiv Detail & Related papers (2024-09-30T10:48:20Z) - Adaptive Draft-Verification for Efficient Large Language Model Decoding [24.347886232342862]
Large language model (LLM) decoding involves generating a sequence of tokens based on a given context.
The typical autoregressive decoding method requires a separate forward pass through the model for each token generated.
We introduce ADED, which accelerates LLM decoding without requiring fine-tuning.
arXiv Detail & Related papers (2024-06-27T22:20:39Z) - LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit [55.73370804397226]
Quantization, a key compression technique, can effectively mitigate these demands by compressing and accelerating large language models.
We present LLMC, a plug-and-play compression toolkit, to fairly and systematically explore the impact of quantization.
Powered by this versatile toolkit, our benchmark covers three key aspects: calibration data, algorithms (three strategies), and data formats.
arXiv Detail & Related papers (2024-05-09T11:49:05Z) - Speculative Streaming: Fast LLM Inference without Auxiliary Models [21.454206732725563]
Speculative Streaming is a single-model speculative decoding method.
It fuses drafting into the target model by changing the fine-tuning objective from next token prediction to future n-gram prediction.
It speeds up decoding by 1.8 - 3.1X in a diverse set of tasks.
arXiv Detail & Related papers (2024-02-16T23:36:43Z) - Extreme Compression of Large Language Models via Additive Quantization [59.3122859349777]
Our algorithm, called AQLM, generalizes the classic Additive Quantization (AQ) approach for information retrieval.
We provide fast GPU and CPU implementations of AQLM for token generation, which enable us to match or outperform optimized FP16 implementations for speed.
arXiv Detail & Related papers (2024-01-11T18:54:44Z) - Fast Chain-of-Thought: A Glance of Future from Parallel Decoding Leads to Answers Faster [61.83949316226113]
FastCoT is a model-agnostic framework based on parallel decoding.
We show that FastCoT saves inference time by nearly 20% with only a negligible performance drop compared to the regular approach.
arXiv Detail & Related papers (2023-11-14T15:56:18Z) - The Synergy of Speculative Decoding and Batching in Serving Large
Language Models [3.3849225405083336]
We propose a new speculative decoding strategy that chooses the optimal speculation length for different batch sizes.
Our evaluations show that our proposed method can achieve equal or better performance than the state-of-the-art speculation decoding schemes with fixed speculation length.
arXiv Detail & Related papers (2023-10-28T20:36:36Z) - DistillSpec: Improving Speculative Decoding via Knowledge Distillation [70.61777015900272]
Speculative decoding (SD) accelerates large language model inference by employing a faster draft model for generating multiple tokens.
We propose DistillSpec that uses knowledge distillation to better align the draft model with the target model, before applying SD.
We show that DistillSpec yields impressive 10 - 45% speedups over standard SD on a range of standard benchmarks.
arXiv Detail & Related papers (2023-10-12T16:21:04Z) - SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification [13.174386920965107]
SpecInfer is a system that accelerates generative large language model (LLM) serving with tree-based speculative inference and verification.
The correctness of all candidate token sequences represented by a token tree is verified against the LLM in parallel using a novel tree-based parallel decoding mechanism.
arXiv Detail & Related papers (2023-05-16T20:12:59Z) - Speculative Decoding: Exploiting Speculative Execution for Accelerating
Seq2seq Generation [80.2267931231335]
We propose Speculative Decoding (SpecDec) to study exploiting the idea of speculative execution to accelerate autoregressive (AR) decoding.
SpecDec has two innovations: Spec-Drafter -- an independent model specially optimized for efficient drafting, and Spec-Verification -- a reliable method for verifying the drafted tokens efficiently.
arXiv Detail & Related papers (2022-03-30T17:27:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.