DynaSpec: Context-aware Dynamic Speculative Sampling for Large-Vocabulary Language Models
- URL: http://arxiv.org/abs/2510.13847v2
- Date: Mon, 10 Nov 2025 12:41:15 GMT
- Title: DynaSpec: Context-aware Dynamic Speculative Sampling for Large-Vocabulary Language Models
- Authors: Jinbin Zhang, Nasib Ullah, Erik Schultheis, Rohit Babbar,
- Abstract summary: DynaSpec is a dynamic shortlisting mechanism that is robust, speeds up drafting, and generalizes across diverse tasks.<n>It delivers consistent improvements in mean accepted length, for Llama-3-8B, reaching upto 98.2% of full-vocabulary performance.<n>By leveraging context-dependent selection, DynaSpec achieves up to a 2.18 times increase in generated tokens compared to 1.91 times for fixed-vocabulary approaches.
- Score: 13.242009624334996
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speculative decoding has become a standard way to accelerate LLM inference: a small drafter proposes multiple tokens and a large target model verifies them once per speculation length. Recently, scaling of the LLM vocabulary has pushed the number of tokens to grow substantially. While verification over the full vocabulary leaves the target model largely unaffected, the O(|V|d) parameters in the drafter's output head become a latency bottleneck, slowing the entire pipeline. Contemporary methods (e.g., FR-Spec, VocabTrim) restrict the drafter's vocabulary to a fixed top frequent subset of the target model's vocabulary. Although this reduces draft-time compute, it is brittle, since: (i) frequency lists are corpus-dependent and require retuning to generalize, and (ii) static shortlists suppress rare or domain-specific tokens, lowering the expected number of tokens per verification step. We propose DynaSpec, a context-dependent dynamic shortlisting mechanism that is robust, speeds up drafting, and generalizes across diverse tasks. Concretely, we introduce lightweight, coarse-grained meta-classifiers that route contexts to a small number of token clusters; the union of the top-k selected clusters forms the drafter's shortlist, while verification retains the full vocabulary and exactness. The meta-classifier finishes its computation earlier than the drafter's hidden state generation by exploiting parallel execution of draft encoding and meta shortlisting on separate streams. Across standard speculative decoding benchmarks, DynaSpec delivers consistent improvements in mean accepted length, for Llama-3-8B, reaching upto 98.2% of full-vocabulary performance, while fixed-shortlist baselines attain only 84.4%. By leveraging context-dependent selection, DynaSpec achieves up to a 2.18 times increase in generated tokens compared to 1.91 times for fixed-vocabulary approaches.
Related papers
- Speculative Decoding with a Speculative Vocabulary [44.656073829954636]
Speculative decoding is a leading approach for accelerating language model (LM) inference.<n>Recent work has sought to address this output distribution bottleneck by reducing the vocabulary of the draft model.<n>We propose SpecVocab, an efficient and effective method that selects a vocabulary subset per decoding step.
arXiv Detail & Related papers (2026-02-14T16:10:00Z) - Fast Inference via Hierarchical Speculative Decoding [65.40448210801763]
We introduce Hierarchical Speculative Decoding (HSD), an algorithm that stacks draft models into a hierarchy, where each model proposes tokens, and the next larger model verifies them in a single forward pass.<n>HSD gives up to 1.2x speed-up over the best single-draft baseline.
arXiv Detail & Related papers (2025-10-22T15:56:19Z) - DiffuSpec: Unlocking Diffusion Language Models for Speculative Decoding [66.40658898418316]
We present DiffuSpec, a training-free drop-in framework that uses a pretrained diffusion language model (DLM) to produce multi-token drafts in a single forward pass.<n>Across benchmarks, DiffuSpec yields up to 3x wall-clock speedup, establishing diffusion-based drafting as a robust alternative to autoregressive drafters for speculative decoding.
arXiv Detail & Related papers (2025-09-28T07:00:15Z) - VOCABTRIM: Vocabulary Pruning for Efficient Speculative Decoding in LLMs [15.508475101753715]
We introduce a training-free technique to improve the performance of drafter-based speculative decoding (SpD) methods.<n>A drafter-based speculative decoding leverages one or more smaller language models, a.k.a. drafters or draft models, to sample a draft sequence or tree consisting of multiple tokens.<n>We show that our method can boost the memory-bound speed-up for Llama-3 models on Spec-Bench, specifically by 16% for Llama-3.2-3B-Instruct.
arXiv Detail & Related papers (2025-06-28T00:26:40Z) - Think Before You Accept: Semantic Reflective Verification for Faster Speculative Decoding [48.52389201779425]
Speculative decoding accelerates inference by generating multiple draft tokens using a lightweight model and verifying them in parallel.<n>Existing verification methods rely heavily on distributional consistency while overlooking semantic correctness.<n>We propose Reflective Verification, a training-free and semantics-aware approach that achieves a better trade-off between correctness and efficiency.
arXiv Detail & Related papers (2025-05-24T10:26:27Z) - FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling [59.8051705468084]
Speculative sampling has emerged as an important technique for accelerating the auto-regressive generation process of large language models.<n>We present FR-Spec, a frequency-ranked speculative sampling framework that optimize draft candidate selection through vocabulary space compression.
arXiv Detail & Related papers (2025-02-20T18:58:10Z) - ParallelSpec: Parallel Drafter for Efficient Speculative Decoding [62.68430939686566]
We present ParallelSpec, an alternative to auto-regressive drafting strategies in state-of-the-art speculative decoding approaches.
In contrast to auto-regressive drafting in the speculative stage, we train a parallel drafter to serve as an efficient speculative model.
arXiv Detail & Related papers (2024-10-08T01:05:08Z) - PEARL: Parallel Speculative Decoding with Adaptive Draft Length [12.166703341906242]
We propose a conceptually simple, flexible, and general framework to boost speculative decoding, namely Parallel spEculative decoding with Adaptive dRaft Length (PEARL)<n>PEARL proposes pre-verify to verify the first draft token in advance during the drafting phase, and post-verify to generate more draft tokens during the verification phase.<n> Experiments on various text generation benchmarks demonstrate the effectiveness of our PEARL, leading to a superior speed up performance up to 4.43$times$ and 1.50$times$, compared to auto-regressive decoding and vanilla speculative decoding, respectively.
arXiv Detail & Related papers (2024-08-13T08:32:06Z) - Ouroboros: Generating Longer Drafts Phrase by Phrase for Faster Speculative Decoding [65.94521678103237]
Speculative decoding is a widely used method that accelerates the generation process of large language models.
We introduce Ouroboros, which can generate draft phrases to parallelize the drafting process.
Ouroboros can achieve speedups of up to $2.8times$ over speculative decoding and $3.9times$ over vanilla decoding.
arXiv Detail & Related papers (2024-02-21T11:31:28Z) - Multi-Candidate Speculative Decoding [82.05519287513444]
Large language models have shown impressive capabilities across a variety of NLP tasks, yet their generating text autoregressively is time-consuming.
One way to speed them up is speculative decoding, which generates candidate segments from a fast draft model that is then verified in parallel by the target model.
This paper proposes sampling multiple candidates from a draft model and then organising them in batches for verification.
We design algorithms for efficient multi-candidate verification while maintaining the distribution of the target model.
arXiv Detail & Related papers (2024-01-12T17:15:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.