Training-Free Loosely Speculative Decoding: Accepting Semantically Correct Drafts Beyond Exact Match
- URL: http://arxiv.org/abs/2511.22972v2
- Date: Fri, 05 Dec 2025 03:21:37 GMT
- Title: Training-Free Loosely Speculative Decoding: Accepting Semantically Correct Drafts Beyond Exact Match
- Authors: Jinze Li, Yixing Xu, Guanchen Li, Shuo Yang, Jinfeng Xu, Xuanwu Yin, Dong Li, Edith C. H. Ngai, Emad Barsoum,
- Abstract summary: Training-Free Loosely Speculative Decoding (FLy) is a novel method that loosens the rigid verification criterion.<n>We show that FLy preserves more than 99% of the target model's accuracy while achieving an average 2.81x speedup.
- Score: 21.810129153556044
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) achieve strong performance across diverse tasks but suffer from high inference latency due to their autoregressive generation. Speculative Decoding (SPD) mitigates this issue by verifying candidate tokens in parallel from a smaller draft model, yet its strict exact-match verification discards many semantically valid continuations. Moreover, existing training-based SPD methods often suffer from performance degradation on out-of-distribution (OOD) tasks. To this end, we propose Training-Free Loosely Speculative Decoding (FLy), a novel method that loosens the rigid verification criterion by leveraging the target model's self-corrective behavior to judge whether a draft-target mismatch remains semantically valid. FLy introduces a two-tier mechanism: an entropy-level gate that identifies whether the current token allows multiple plausible alternatives or is nearly deterministic, and a token-level deferred window that distinguishes genuine errors from differently worded yet semantically correct variants. To further reduce latency, we design a multi-level acceleration strategy that accelerates not only the target model but also the drafter itself. Owing to its training-free design, FLy composes seamlessly with arbitrary draft-target pairs and generalizes across models and domains without hyperparameter re-tuning. Experiments show that FLy preserves more than 99% of the target model's accuracy while achieving an average 2.81x speedup on Llama-3.1-70B-Instruct and 5.07x speedup on the 405B variant. Notably, on out-of-domain datasets, our method remains highly effective and outperforms the training-based method EAGLE-3 by 1.62x.
Related papers
- KnapSpec: Self-Speculative Decoding via Adaptive Layer Selection as a Knapsack Problem [12.668341559890605]
We propose KnapSpec, a training-free framework that reformulates draft model selection as a knapsack problem to maximize tokens-per-time throughput.<n>We provide the first rigorous theoretical analysis establishing cosine similarity between hidden states as a mathematically sound proxy for the token acceptance rate.<n>Our experiments on Qwen3 and Llama3 demonstrate that KnapSpec consistently outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2026-02-23T08:13:03Z) - PACER: Blockwise Pre-verification for Speculative Decoding with Adaptive Length [21.738896310075678]
Speculative decoding (SD) is a powerful technique for accelerating the inference process of large language models (LLMs)<n>We propose Pacer, a novel approach that dynamically controls draft length using a lightweight, trainable pre-verification layer.<n>Our results demonstrate that Pacer achieves up to 2.66x Speedup over autoregressive decoding and consistently outperforms standard speculative decoding.
arXiv Detail & Related papers (2026-02-01T15:12:38Z) - PROMISE: Process Reward Models Unlock Test-Time Scaling Laws in Generative Recommendations [52.67948063133533]
Generative Recommendation has emerged as a promising paradigm, reformulating recommendation as a sequence-to-sequence generation task over hierarchical Semantic IDs.<n>Existing methods suffer from a critical issue we term Semantic Drift, where errors in early, high-level tokens irreversibly divert the generation trajectory into irrelevant semantic subspaces.<n>We propose Promise, a novel framework that integrates dense, step-by-step verification into generative models.
arXiv Detail & Related papers (2026-01-08T07:38:46Z) - AdaSD: Adaptive Speculative Decoding for Efficient Language Model Inference [1.1852406625172216]
We propose Adaptive Speculative Decoding (AdaSD) for large language models (LLMs)<n>AdaSD dynamically adjusts generation length and acceptance criteria during inference.<n> Experiments on benchmark datasets demonstrate that AdaSD achieves up to 49% speedup over standard speculative decoding.
arXiv Detail & Related papers (2025-12-12T04:56:08Z) - Arbitrage: Efficient Reasoning via Advantage-Aware Speculation [71.45710345765528]
Speculative Decoding accelerates inference by employing a fast but inaccurate draft model to autoregressively propose tokens.<n>But due to unnecessary rejections caused by token mismatches in semantically equivalent steps, traditional token-level Speculative Decoding struggles in reasoning tasks.<n>We propose Arbitrage, a novel step-level speculative generation framework that routes generation dynamically based on the relative advantage between draft and target models.
arXiv Detail & Related papers (2025-12-04T17:50:53Z) - SpecDiff-2: Scaling Diffusion Drafter Alignment For Faster Speculative Decoding [48.96349422252313]
Speculative decoding has become the standard approach for accelerating Large Language Model (LLM) inference.<n>It exploits a lossless draft-then-verify procedure to circumvent the latency of autoregressive decoding, achieving impressive speed-ups.<n>This paper proposes SpecDiff-2, a novel framework to jointly address these two bottlenecks.
arXiv Detail & Related papers (2025-11-01T16:12:56Z) - Fast Inference via Hierarchical Speculative Decoding [65.40448210801763]
We introduce Hierarchical Speculative Decoding (HSD), an algorithm that stacks draft models into a hierarchy, where each model proposes tokens, and the next larger model verifies them in a single forward pass.<n>HSD gives up to 1.2x speed-up over the best single-draft baseline.
arXiv Detail & Related papers (2025-10-22T15:56:19Z) - DiffuSpec: Unlocking Diffusion Language Models for Speculative Decoding [66.40658898418316]
We present DiffuSpec, a training-free drop-in framework that uses a pretrained diffusion language model (DLM) to produce multi-token drafts in a single forward pass.<n>Across benchmarks, DiffuSpec yields up to 3x wall-clock speedup, establishing diffusion-based drafting as a robust alternative to autoregressive drafters for speculative decoding.
arXiv Detail & Related papers (2025-09-28T07:00:15Z) - CARD: A Cache-Assisted Parallel Speculative Decoding Framework via Query-and-Correct Paradigm for Accelerating LLM Inference [14.527697328189362]
We propose a speculative decoding framework called CARD, which employs a novel query-and-correct paradigm.<n>Our approach decouples drafting from verification, effectively leveraging the draft model's efficiency without additional fine-tuning.<n> CARD significantly outperforms existing state-of-the-art methods, achieving up to a 4.83x acceleration over vanilla autoregressive decoding.
arXiv Detail & Related papers (2025-08-06T14:02:10Z) - Think Before You Accept: Semantic Reflective Verification for Faster Speculative Decoding [48.52389201779425]
Speculative decoding accelerates inference by generating multiple draft tokens using a lightweight model and verifying them in parallel.<n>Existing verification methods rely heavily on distributional consistency while overlooking semantic correctness.<n>We propose Reflective Verification, a training-free and semantics-aware approach that achieves a better trade-off between correctness and efficiency.
arXiv Detail & Related papers (2025-05-24T10:26:27Z) - Automatic Task Detection and Heterogeneous LLM Speculative Decoding [1.0485739694839669]
We propose a speculative decoding algorithm tailored for downstream task optimization.<n>It includes an automatic task partitioning and assigning method, which automatically categorizes downstream tasks into different sub-tasks.<n> Experimental results demonstrate that the proposed method improves draft accuracy by 6% to 50% over vanilla speculative decoding.
arXiv Detail & Related papers (2025-05-13T14:16:12Z) - GRIFFIN: Effective Token Alignment for Faster Speculative Decoding [33.26750782762635]
GRIFFIN is a novel framework that incorporates a token-alignable training strategy and a token-alignable draft model.<n>Experiments on LLaMA, Vicuna, Qwen and Mixtral models demonstrate that GRIFFIN achieves an average acceptance length improvement of over 8% and a speedup ratio exceeding 7%.
arXiv Detail & Related papers (2025-02-16T07:06:00Z) - Judge Decoding: Faster Speculative Sampling Requires Going Beyond Model Alignment [25.988070517700848]
Speculative decoding has been proposed as a technique to accelerate autoregressive generation.<n>We show that even powerful draft models such as GPT-4o, as well as human text cannot achieve high acceptance rates.<n>We ask the following question: Can we adapt verification to recognize correct, but non-aligned replies?
arXiv Detail & Related papers (2025-01-31T17:09:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.