Speculative Decoding with a Speculative Vocabulary
- URL: http://arxiv.org/abs/2602.13836v1
- Date: Sat, 14 Feb 2026 16:10:00 GMT
- Title: Speculative Decoding with a Speculative Vocabulary
- Authors: Miles Williams, Young D. Kwon, Rui Li, Alexandros Kouris, Stylianos I. Venieris,
- Abstract summary: Speculative decoding is a leading approach for accelerating language model (LM) inference.<n>Recent work has sought to address this output distribution bottleneck by reducing the vocabulary of the draft model.<n>We propose SpecVocab, an efficient and effective method that selects a vocabulary subset per decoding step.
- Score: 44.656073829954636
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speculative decoding has rapidly emerged as a leading approach for accelerating language model (LM) inference, as it offers substantial speedups while yielding identical outputs. This relies upon a small draft model, tasked with predicting the outputs of the target model. State-of-the-art speculative decoding methods use a draft model consisting of a single decoder layer and output embedding matrix, with the latter dominating drafting time for the latest LMs. Recent work has sought to address this output distribution bottleneck by reducing the vocabulary of the draft model. Although this can improve throughput, it compromises speculation effectiveness when the target token is out-of-vocabulary. In this paper, we argue for vocabulary speculation as an alternative to a reduced vocabulary. We propose SpecVocab, an efficient and effective method that selects a vocabulary subset per decoding step. Across a variety of tasks, we demonstrate that SpecVocab can achieve a higher acceptance length than state-of-the-art speculative decoding approach, EAGLE-3. Notably, this yields up to an 8.1% increase in average throughput over EAGLE-3.
Related papers
- Balancing Coverage and Draft Latency in Vocabulary Trimming for Faster Speculative Decoding [1.14219428942199]
We improve speculative decoding throughput while reducing draft vocabularies by up to 97% with high coverage.<n>On domain-specific tasks, we achieve up to 16% latency reduction and 20% throughput improvement, and up to 6.7% throughput gains on diverse out-of-distribution tasks.
arXiv Detail & Related papers (2026-03-05T14:20:22Z) - VOCABTRIM: Vocabulary Pruning for Efficient Speculative Decoding in LLMs [15.508475101753715]
We introduce a training-free technique to improve the performance of drafter-based speculative decoding (SpD) methods.<n>A drafter-based speculative decoding leverages one or more smaller language models, a.k.a. drafters or draft models, to sample a draft sequence or tree consisting of multiple tokens.<n>We show that our method can boost the memory-bound speed-up for Llama-3 models on Spec-Bench, specifically by 16% for Llama-3.2-3B-Instruct.
arXiv Detail & Related papers (2025-06-28T00:26:40Z) - Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding [53.82301522384719]
We propose Dimple, the first Discrete Multimodal Large Language Model (DMLLM)<n>We design a novel training paradigm that combines an initial autoregressive phase with a subsequent diffusion phase.<n>Dimple-7B surpasses LLaVA- in performance by 3.9%, demonstrating that DMLLM can achieve performance comparable to that of autoregressive models.
arXiv Detail & Related papers (2025-05-22T17:55:04Z) - FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling [59.8051705468084]
Speculative sampling has emerged as an important technique for accelerating the auto-regressive generation process of large language models.<n>We present FR-Spec, a frequency-ranked speculative sampling framework that optimize draft candidate selection through vocabulary space compression.
arXiv Detail & Related papers (2025-02-20T18:58:10Z) - FastDraft: How to Train Your Draft [1.187801290987671]
We introduce FastDraft, a novel and efficient approach for pre-training and aligning a draft model to any large language model.<n>We demonstrate FastDraft by training two highly parameter efficient drafts for the popular Phi-3-mini and Llama-3.1-8B models.<n>Using FastDraft, we were able to produce a draft model with approximately 10 billion tokens on a single server with 8 Intel$circledR$ Gaudi$circledR$ 2 accelerators in under 24 hours.
arXiv Detail & Related papers (2024-11-17T12:32:44Z) - Graph-Structured Speculative Decoding [52.94367724136063]
Speculative decoding has emerged as a promising technique to accelerate the inference of Large Language Models.
We introduce an innovative approach utilizing a directed acyclic graph (DAG) to manage the drafted hypotheses.
We observe a remarkable speedup of 1.73$times$ to 1.96$times$, significantly surpassing standard speculative decoding.
arXiv Detail & Related papers (2024-07-23T06:21:24Z) - Chimera: A Lossless Decoding Method for Accelerating Large Language Models Inference by Fusing all Tokens [15.566726645722657]
We propose a novel framework specifically designed for speculative sampling.
Within this framework, we introduce a lightweight draft model that effectively utilizes previously generated tokens to predict subsequent words.
We demonstrate impressive results, achieving an average latency speedup ratio of 2.7x compared to the vanilla auto-regressive decoding approach.
arXiv Detail & Related papers (2024-02-24T08:10:39Z) - Ouroboros: Generating Longer Drafts Phrase by Phrase for Faster Speculative Decoding [65.94521678103237]
Speculative decoding is a widely used method that accelerates the generation process of large language models.
We introduce Ouroboros, which can generate draft phrases to parallelize the drafting process.
Ouroboros can achieve speedups of up to $2.8times$ over speculative decoding and $3.9times$ over vanilla decoding.
arXiv Detail & Related papers (2024-02-21T11:31:28Z) - Multi-Candidate Speculative Decoding [82.05519287513444]
Large language models have shown impressive capabilities across a variety of NLP tasks, yet their generating text autoregressively is time-consuming.
One way to speed them up is speculative decoding, which generates candidate segments from a fast draft model that is then verified in parallel by the target model.
This paper proposes sampling multiple candidates from a draft model and then organising them in batches for verification.
We design algorithms for efficient multi-candidate verification while maintaining the distribution of the target model.
arXiv Detail & Related papers (2024-01-12T17:15:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.