Related papers: Double: Breaking the Acceleration Limit via Double Retrieval Speculative Parallelism

Double: Breaking the Acceleration Limit via Double Retrieval Speculative Parallelism

URL: http://arxiv.org/abs/2601.05524v1
Date: Fri, 09 Jan 2026 04:35:21 GMT
Title: Double: Breaking the Acceleration Limit via Double Retrieval Speculative Parallelism
Authors: Yuhao Shen, Tianyu Liu, Junyi Shen, Jinyang Wu, Quan Kong, Li Huan, Cong Wang,
Abstract summary: We introduce textscDouble (Double Retrieval Speculative Parallelism)<n>We enable the draft model to execute iterative retrieval speculations to break the theoretical speedup limits.<n>Experiments demonstrate state-of-the-art speedup of $textbf5.3times$ on LLaMA3.3-70B and $textbf2.8times$ on Qwen3-32B.
Score: 19.7914286780195
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Parallel Speculative Decoding (PSD) accelerates traditional Speculative Decoding (SD) by overlapping draft generation with verification. However, it remains hampered by two fundamental challenges: (1) a theoretical speedup ceiling dictated by the speed ratio between the draft and target models, and (2) high computational waste and pipeline stall due to mid-sequence token rejections of early errors. To address these limitations, we introduce \textsc{Double} (Double Retrieval Speculative Parallelism). By bridging the gap between SD and PSD, our framework resolves the Retrieval \emph{Precision-Efficiency Dilemma} through a novel synchronous mechanism. Specifically, we enable the draft model to execute iterative retrieval speculations to break the theoretical speedup limits; to alleviate rejections without rollback, the target model performs authoritative retrieval to generate multi-token guidance. \textsc{Double} is entirely training-free and lossless. Extensive experiments demonstrate state-of-the-art speedup of $\textbf{5.3}\times$ on LLaMA3.3-70B and $\textbf{2.8}\times$ on Qwen3-32B, significantly outperforming the advanced method EAGLE-3 that requires extensive model training.

Related papers

TriSpec: Ternary Speculative Decoding via Lightweight Proxy Verification [63.65902785448346]
Speculative decoding offers significant speed-ups through its lightweight drafting and parallel verification mechanism.<n>We propose TriSpec, a novel ternary SD framework that introduces a lightweight proxy to significantly reduce computational cost.<n>Experiments on the Qwen3 and DeepSeek-R1-Distill-Qwen/LLaMA families show that TriSpec achieves up to 35% speedup over standard SD.
arXiv Detail & Related papers (2026-01-30T17:04:18Z)
DART: Diffusion-Inspired Speculative Decoding for Fast LLM Inference [27.204773545145326]
DART is a speculative decoding framework for large language models (dLLMs)<n>It leverages parallel generation to reduce drafting latency.<n>It achieves a 2.03x--3.44x wall-clock time speedup across multiple datasets.
arXiv Detail & Related papers (2026-01-27T07:04:24Z)
VVS: Accelerating Speculative Decoding for Visual Autoregressive Generation via Partial Verification Skipping [52.58270801983525]
speculative decoding (SD) has been proven effective for accelerating visual AR models.<n>We propose a novel framework VVS to accelerate visual AR generation via partial verification skipping.
arXiv Detail & Related papers (2025-11-17T16:50:58Z)
SpecDiff-2: Scaling Diffusion Drafter Alignment For Faster Speculative Decoding [48.96349422252313]
Speculative decoding has become the standard approach for accelerating Large Language Model (LLM) inference.<n>It exploits a lossless draft-then-verify procedure to circumvent the latency of autoregressive decoding, achieving impressive speed-ups.<n>This paper proposes SpecDiff-2, a novel framework to jointly address these two bottlenecks.
arXiv Detail & Related papers (2025-11-01T16:12:56Z)
Mirror Speculative Decoding: Breaking the Serial Barrier in LLM Inference [11.957170239588535]
Speculative decoding accelerates inference by using a draft model to look ahead.<n>Prior methods partially reduce draft cost but either degrade acceptance or introduce overheads that limit scaling.<n>We present Mirror Speculative Decoding (Mirror-SD), an inference algorithm that breaks the latency-acceptance tradeoff.
arXiv Detail & Related papers (2025-10-15T05:22:57Z)
DiffuSpec: Unlocking Diffusion Language Models for Speculative Decoding [66.40658898418316]
We present DiffuSpec, a training-free drop-in framework that uses a pretrained diffusion language model (DLM) to produce multi-token drafts in a single forward pass.<n>Across benchmarks, DiffuSpec yields up to 3x wall-clock speedup, establishing diffusion-based drafting as a robust alternative to autoregressive drafters for speculative decoding.
arXiv Detail & Related papers (2025-09-28T07:00:15Z)
FastEagle: Cascaded Drafting for Accelerating Speculative Decoding [6.482154864678126]
We present FastEagle, a non-autoregressive cascaded drafter that emits an entire draft in a single forward pass.<n>FastEagle delivers substantial wall-clock speedups over strong autoregressive drafters while maintaining competitive acceptance behavior.
arXiv Detail & Related papers (2025-09-24T09:38:32Z)
Pipeline Parallelism is All You Need for Optimized Early-Exit Based Self-Speculative Decoding [73.67253077506672]
Large language models (LLMs) deliver impressive generation quality, but incur very high inference cost.<n>Early-exit based self-speculative decoding (EESD) has emerged to mitigate this cost.<n>We propose Pipeline-Parallel Self-Speculative Decoding (PPSD) that fully pipelines the draft and verification work.
arXiv Detail & Related papers (2025-09-19T04:51:41Z)
MPQ-DMv2: Flexible Residual Mixed Precision Quantization for Low-Bit Diffusion Models with Temporal Distillation [74.34220141721231]
We present MPQ-DMv2, an improved textbfMixed textbfPrecision textbfQuantization framework for extremely low-bit textbfDiffusion textbfModels.
arXiv Detail & Related papers (2025-07-06T08:16:50Z)
Accelerating LLM Inference with Lossless Speculative Decoding Algorithms for Heterogeneous Vocabularies [7.14946066475415]
Speculative decoding (SD) methods offer substantial efficiency gains by generating multiple tokens using a single target forward pass.<n>Existing SD approaches require the drafter and target models to share the same vocabulary, thus limiting the pool of possible drafters.<n>We present three new SD methods that remove this shared-vocabulary constraint.<n>Our algorithms demonstrate significant speedups of up to 2.8x over standard autoregressive decoding.
arXiv Detail & Related papers (2025-01-31T19:13:58Z)
FastLR: Non-Autoregressive Lipreading Model with Integrate-and-Fire [74.04394069262108]
We propose FastLR, a non-autoregressive (NAR) lipreading model which generates all target tokens simultaneously. FastLR achieves the speedup up to 10.97$times$ compared with state-of-the-art lipreading model.
arXiv Detail & Related papers (2020-08-06T08:28:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.