DEER: Draft with Diffusion, Verify with Autoregressive Models
- URL: http://arxiv.org/abs/2512.15176v1
- Date: Wed, 17 Dec 2025 08:19:04 GMT
- Title: DEER: Draft with Diffusion, Verify with Autoregressive Models
- Authors: Zicong Cheng, Guo-Wei Yang, Jia Li, Zhijie Deng, Meng-Hao Guo, Shi-Min Hu,
- Abstract summary: Speculative decoding mitigates the inherent latency of autoregressive decoding.<n>We introduce DEER, an efficient speculative decoding framework.<n>Experiments show DEER reaches draft acceptance lengths of up to 32 tokens.
- Score: 33.19684425811274
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Efficiency, as a critical practical challenge for LLM-driven agentic and reasoning systems, is increasingly constrained by the inherent latency of autoregressive (AR) decoding. Speculative decoding mitigates this cost through a draft-verify scheme, yet existing approaches rely on AR draft models (a.k.a., drafters), which introduce two fundamental issues: (1) step-wise uncertainty accumulation leads to a progressive collapse of trust between the target model and the drafter, and (2) inherently sequential decoding of AR drafters. Together, these factors cause limited speedups. In this paper, we show that a diffusion large language model (dLLM) drafters can naturally overcome these issues through its fundamentally different probabilistic modeling and efficient parallel decoding strategy. Building on this insight, we introduce DEER, an efficient speculative decoding framework that drafts with diffusion and verifies with AR models. To enable high-quality drafting, DEER employs a two-stage training pipeline to align the dLLM-based drafters with the target AR model, and further adopts single-step decoding to generate long draft segments. Experiments show DEER reaches draft acceptance lengths of up to 32 tokens, far surpassing the 10 tokens achieved by EAGLE-3. Moreover, on HumanEval with Qwen3-30B-A3B, DEER attains a 5.54x speedup, while EAGLE-3 achieves only 2.41x. Code, model, demo, etc, will be available at https://czc726.github.io/DEER/
Related papers
- DART: Diffusion-Inspired Speculative Decoding for Fast LLM Inference [27.204773545145326]
DART is a speculative decoding framework for large language models (dLLMs)<n>It leverages parallel generation to reduce drafting latency.<n>It achieves a 2.03x--3.44x wall-clock time speedup across multiple datasets.
arXiv Detail & Related papers (2026-01-27T07:04:24Z) - TiDAR: Think in Diffusion, Talk in Autoregression [59.94106070312094]
TiDAR is a sequence-level hybrid architecture that drafts tokens (Thinking) in Diffusion and samples final outputs (Talking) AutoRegressively.<n> TiDAR is the first architecture to close the quality gap with AR models while delivering 4.71x to 5.91x more tokens per second.
arXiv Detail & Related papers (2025-11-12T02:59:33Z) - Fast Inference via Hierarchical Speculative Decoding [65.40448210801763]
We introduce Hierarchical Speculative Decoding (HSD), an algorithm that stacks draft models into a hierarchy, where each model proposes tokens, and the next larger model verifies them in a single forward pass.<n>HSD gives up to 1.2x speed-up over the best single-draft baseline.
arXiv Detail & Related papers (2025-10-22T15:56:19Z) - Every Step Counts: Decoding Trajectories as Authorship Fingerprints of dLLMs [63.82840470917859]
We show that the decoding mechanism of dLLMs can be used as a powerful tool for model attribution.<n>We propose a novel information extraction scheme called the Directed Decoding Map (DDM), which captures structural relationships between decoding steps and better reveals model-specific behaviors.
arXiv Detail & Related papers (2025-10-02T06:25:10Z) - DiffuSpec: Unlocking Diffusion Language Models for Speculative Decoding [66.40658898418316]
We present DiffuSpec, a training-free drop-in framework that uses a pretrained diffusion language model (DLM) to produce multi-token drafts in a single forward pass.<n>Across benchmarks, DiffuSpec yields up to 3x wall-clock speedup, establishing diffusion-based drafting as a robust alternative to autoregressive drafters for speculative decoding.
arXiv Detail & Related papers (2025-09-28T07:00:15Z) - PEARL: Parallel Speculative Decoding with Adaptive Draft Length [12.166703341906242]
We propose a conceptually simple, flexible, and general framework to boost speculative decoding, namely Parallel spEculative decoding with Adaptive dRaft Length (PEARL)<n>PEARL proposes pre-verify to verify the first draft token in advance during the drafting phase, and post-verify to generate more draft tokens during the verification phase.<n> Experiments on various text generation benchmarks demonstrate the effectiveness of our PEARL, leading to a superior speed up performance up to 4.43$times$ and 1.50$times$, compared to auto-regressive decoding and vanilla speculative decoding, respectively.
arXiv Detail & Related papers (2024-08-13T08:32:06Z) - EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees [25.703729145091483]
In this paper, we propose a new technique of context-aware dynamic draft tree into drafting modeling.
We conducted extensive evaluations on three series of Large Language Models (LLMs) and six tasks.
arXiv Detail & Related papers (2024-06-24T17:59:11Z) - DistillSpec: Improving Speculative Decoding via Knowledge Distillation [70.61777015900272]
Speculative decoding (SD) accelerates large language model inference by employing a faster draft model for generating multiple tokens.
We propose DistillSpec that uses knowledge distillation to better align the draft model with the target model, before applying SD.
We show that DistillSpec yields impressive 10 - 45% speedups over standard SD on a range of standard benchmarks.
arXiv Detail & Related papers (2023-10-12T16:21:04Z) - Online Speculative Decoding [34.987825705622555]
We introduce online speculative decoding to accelerate the inference of large language models.
The main idea is to continuously update the (multiple) draft model(s) on observed user query data.
We develop a prototype of online speculative decoding based on knowledge distillation and evaluate it using both synthetic and real query data.
arXiv Detail & Related papers (2023-10-11T04:03:42Z) - FastLR: Non-Autoregressive Lipreading Model with Integrate-and-Fire [74.04394069262108]
We propose FastLR, a non-autoregressive (NAR) lipreading model which generates all target tokens simultaneously.
FastLR achieves the speedup up to 10.97$times$ compared with state-of-the-art lipreading model.
arXiv Detail & Related papers (2020-08-06T08:28:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.