Related papers: Speculative Decoding: Performance or Illusion?

Speculative Decoding: Performance or Illusion?

URL: http://arxiv.org/abs/2601.11580v1
Date: Wed, 31 Dec 2025 20:31:36 GMT
Title: Speculative Decoding: Performance or Illusion?
Authors: Xiaoxuan Liu, Jiaxiang Yu, Jongseok Park, Ion Stoica, Alvin Cheung,
Abstract summary: We present the first systematic study of Speculative decoding (SD) on a production-grade and widely deployed inference engine (vLLM)<n>We analyze key factors governing SD performance, and quantify a theoretical upper bound on SD speedup.<n>Our results show that verification by the target model dominates the execution, while acceptance length varies markedly across output token positions, requests, and datasets.
Score: 35.22216866848279
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Speculative decoding (SD) has become a popular technique to accelerate Large Language Model (LLM) inference, yet its real-world effectiveness remains unclear as prior evaluations rely on research prototypes and unrealistically small batch sizes. We present, to our knowledge, the first systematic study of SD on a production-grade and widely deployed inference engine (vLLM), covering multiple SD variants ($n$-gram, EAGLE/EAGLE-3, Draft-Model, Multi-Token Prediction) across diverse workloads, model scales, and batch sizes. We analyze key factors governing SD performance, and quantify a theoretical upper bound on SD speedup. Our results show that verification by the target model dominates the execution, while acceptance length varies markedly across output token positions, requests, and datasets. Comparing measured performance with theoretical bounds reveals substantial gaps between observed and theoretical upper bounds, and we leverage this observation to highlight new research opportunities that our study opens up in improving SD.

Related papers

How Efficient Are Diffusion Language Models? A Critical Examination of Efficiency Evaluation Practices [81.85465545346266]
Diffusion language models (DLMs) have emerged as a promising alternative to the long-dominant autoregressive (AR) paradigm.<n>Yet, current open-source DLMs often underperform their AR counterparts in speed, limiting their real-world utility.<n>This work presents a systematic study of DLM efficiency, identifying key issues in prior evaluation methods.
arXiv Detail & Related papers (2025-10-21T10:00:32Z)
Quantization Meets dLLMs: A Systematic Study of Post-training Quantization for Diffusion LLMs [78.09559830840595]
We present the first systematic study on quantizing diffusion-based language models.<n>We identify the presence of activation outliers, characterized by abnormally large activation values.<n>We implement state-of-the-art PTQ methods and conduct a comprehensive evaluation.
arXiv Detail & Related papers (2025-08-20T17:59:51Z)
Consultant Decoding: Yet Another Synergistic Mechanism [49.996656694586164]
Consultant Decoding (CD) verifies candidate drafts using token-level likelihoods computed solely by the large language model.<n>CD achieves up to a 2.5-fold increase in inference speed compared to the target model, while maintaining comparable generation quality.
arXiv Detail & Related papers (2025-06-03T03:13:27Z)
MoESD: Unveil Speculative Decoding's Potential for Accelerating Sparse MoE [14.345207231093722]
Speculative decoding (SD) is a widely used technique to accelerate LLM inference without accuracy loss.<n>We show that under medium batch sizes, MoE surprisingly benefits more from SD than dense models.<n>We introduce a new metric 'target efficiency' that characterizes these effects.
arXiv Detail & Related papers (2025-05-26T08:01:45Z)
Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning [108.07030347318624]
We show that scaling with longer Chain of Thoughts (CoTs) can indeed impair the reasoning performance of Large Language Models (LLMs) in certain domains.<n>We propose a Thinking- Optimal Scaling strategy to teach models to adopt different reasoning efforts for deep thinking.<n>Our self-improvement models built upon Qwen2.5-32B-Instruct outperform other distillation-based 32B o1-like models across various math benchmarks.
arXiv Detail & Related papers (2025-02-25T10:48:05Z)
The Effect of Optimal Self-Distillation in Noisy Gaussian Mixture Model [2.355460994057843]
Self-distillation (SD) has attracted attention as a simple yet powerful approach in machine learning.<n>Despite its widespread use, the mechanisms underlying its effectiveness remain unclear.
arXiv Detail & Related papers (2025-01-27T17:20:48Z)
UNEM: UNrolled Generalized EM for Transductive Few-Shot Learning [35.62208317531141]
We advocate and introduce the unrolling paradigm, also referred to as "learning to optimize"<n>Our unrolling approach covers various statistical feature distributions and pre-training paradigms.<n>We report comprehensive experiments, which cover a breadth of fine-grained downstream image classification tasks.
arXiv Detail & Related papers (2024-12-21T19:01:57Z)
USDRL: Unified Skeleton-Based Dense Representation Learning with Multi-Grained Feature Decorrelation [24.90512145836643]
We introduce a Unified Skeleton-based Dense Representation Learning framework based on feature decorrelation.<n>We show that our approach significantly outperforms the current state-of-the-art (SOTA) approaches.
arXiv Detail & Related papers (2024-12-12T12:20:27Z)
An Empirical Study on Distribution Shift Robustness From the Perspective of Pre-Training and Data Augmentation [91.62129090006745]
This paper studies the distribution shift problem from the perspective of pre-training and data augmentation. We provide the first comprehensive empirical study focusing on pre-training and data augmentation.
arXiv Detail & Related papers (2022-05-25T13:04:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.