Related papers: FastEagle: Cascaded Drafting for Accelerating Speculative Decoding

FastEagle: Cascaded Drafting for Accelerating Speculative Decoding

URL: http://arxiv.org/abs/2509.20416v1
Date: Wed, 24 Sep 2025 09:38:32 GMT
Title: FastEagle: Cascaded Drafting for Accelerating Speculative Decoding
Authors: Haiduo Huang, Jiangcheng Song, Wenzhe Zhao, Pengju Ren,
Abstract summary: We present FastEagle, a non-autoregressive cascaded drafter that emits an entire draft in a single forward pass.<n>FastEagle delivers substantial wall-clock speedups over strong autoregressive drafters while maintaining competitive acceptance behavior.
Score: 6.482154864678126
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Speculative decoding accelerates generation by drafting candidates and verifying them in parallel, yet state-of-the-art drafters (e.g., EAGLE) still require N sequential passes to propose N tokens. We present FastEagle, a non-autoregressive cascaded drafter that emits an entire draft in a single forward pass. FastEagle replaces temporal steps with a lightweight layer cascade and trains with layer-wise supervision to mitigate error accumulation. Coupled with a constrained draft tree that preserves lossless verification cost, FastEagle delivers substantial wall-clock speedups over strong autoregressive drafters while maintaining competitive acceptance behavior. Across multiple LLMs (Vicuna-13B, LLaMA-Instruct 3.x, and DeepSeek-R1-Distill-LLaMA) and tasks (MT-Bench, HumanEval, GSM8K, CNN/DM, Alpaca), FastEagle consistently outperforms EAGLE-3 in speedup under both greedy and stochastic decoding, with comparable average acceptance lengths. These results indicate that removing sequential dependencies in drafting is a practical path toward lossless LLM inference acceleration.

Related papers

TriSpec: Ternary Speculative Decoding via Lightweight Proxy Verification [63.65902785448346]
Speculative decoding offers significant speed-ups through its lightweight drafting and parallel verification mechanism.<n>We propose TriSpec, a novel ternary SD framework that introduces a lightweight proxy to significantly reduce computational cost.<n>Experiments on the Qwen3 and DeepSeek-R1-Distill-Qwen/LLaMA families show that TriSpec achieves up to 35% speedup over standard SD.
arXiv Detail & Related papers (2026-01-30T17:04:18Z)
DART: Diffusion-Inspired Speculative Decoding for Fast LLM Inference [27.204773545145326]
DART is a speculative decoding framework for large language models (dLLMs)<n>It leverages parallel generation to reduce drafting latency.<n>It achieves a 2.03x--3.44x wall-clock time speedup across multiple datasets.
arXiv Detail & Related papers (2026-01-27T07:04:24Z)
Double: Breaking the Acceleration Limit via Double Retrieval Speculative Parallelism [19.7914286780195]
We introduce textscDouble (Double Retrieval Speculative Parallelism)<n>We enable the draft model to execute iterative retrieval speculations to break the theoretical speedup limits.<n>Experiments demonstrate state-of-the-art speedup of $textbf5.3times$ on LLaMA3.3-70B and $textbf2.8times$ on Qwen3-32B.
arXiv Detail & Related papers (2026-01-09T04:35:21Z)
Fail Fast, Win Big: Rethinking the Drafting Strategy in Speculative Decoding via Diffusion LLMs [8.881949061263784]
We show dLLM's speed from parallel decoding drastically lowers the risk of costly rejections.<n>We present FailFast, a dLLM-based speculative decoding framework that realizes this approach by dynamically adapting its speculation length.
arXiv Detail & Related papers (2025-12-23T18:16:58Z)
Fast Inference via Hierarchical Speculative Decoding [65.40448210801763]
We introduce Hierarchical Speculative Decoding (HSD), an algorithm that stacks draft models into a hierarchy, where each model proposes tokens, and the next larger model verifies them in a single forward pass.<n>HSD gives up to 1.2x speed-up over the best single-draft baseline.
arXiv Detail & Related papers (2025-10-22T15:56:19Z)
Mirror Speculative Decoding: Breaking the Serial Barrier in LLM Inference [11.957170239588535]
Speculative decoding accelerates inference by using a draft model to look ahead.<n>Prior methods partially reduce draft cost but either degrade acceptance or introduce overheads that limit scaling.<n>We present Mirror Speculative Decoding (Mirror-SD), an inference algorithm that breaks the latency-acceptance tradeoff.
arXiv Detail & Related papers (2025-10-15T05:22:57Z)
DiffuSpec: Unlocking Diffusion Language Models for Speculative Decoding [66.40658898418316]
We present DiffuSpec, a training-free drop-in framework that uses a pretrained diffusion language model (DLM) to produce multi-token drafts in a single forward pass.<n>Across benchmarks, DiffuSpec yields up to 3x wall-clock speedup, establishing diffusion-based drafting as a robust alternative to autoregressive drafters for speculative decoding.
arXiv Detail & Related papers (2025-09-28T07:00:15Z)
Pipeline Parallelism is All You Need for Optimized Early-Exit Based Self-Speculative Decoding [73.67253077506672]
Large language models (LLMs) deliver impressive generation quality, but incur very high inference cost.<n>Early-exit based self-speculative decoding (EESD) has emerged to mitigate this cost.<n>We propose Pipeline-Parallel Self-Speculative Decoding (PPSD) that fully pipelines the draft and verification work.
arXiv Detail & Related papers (2025-09-19T04:51:41Z)
Diffusion Language Models Know the Answer Before Decoding [56.96815863705218]
Diffusion language models (DLMs) have emerged as an alternative to autoregressive approaches.<n>Our work highlights and leverage an overlooked property of DLMs early answer convergence.<n>We introduce Prophet, a training-free fast decoding paradigm that enables early commit decoding.
arXiv Detail & Related papers (2025-08-27T15:40:25Z)
R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning [80.104336426172]
Chain-of-thought (CoT) enhances problem-solving ability of large language models.<n>CoT incurs substantial inference cost due to long autoregressive trajectories.<n>We introduce R-Stitch, a training-free hybrid decoding framework.
arXiv Detail & Related papers (2025-07-23T08:14:36Z)
Falcon: Faster and Parallel Inference of Large Language Models through Enhanced Semi-Autoregressive Drafting and Custom-Designed Decoding Tree [7.438117410146904]
Falcon is an innovative speculative decoding framework fashioned to augment both the drafter's parallelism and output quality.<n>Falcon incorporates the Coupled Sequential Glancing Distillation technique, which fortifies inter-token dependencies within the same block, leading to increased speculation accuracy.
arXiv Detail & Related papers (2024-12-17T08:02:08Z)
Efficient Inference for Large Language Model-based Generative Recommendation [78.38878421030522]
Large Language Model (LLM)-based generative recommendation has achieved notable success, yet its practical deployment is costly.<n>Applying Speculative Decoding (SD) to generative recommendation presents unique challenges due to the requirement of generating top-K items.<n>We propose an alignment framework named AtSpeed, which presents the AtSpeed-S optimization objective for top-K alignment under the strict top-K verification.
arXiv Detail & Related papers (2024-10-07T16:23:36Z)
Ouroboros: Generating Longer Drafts Phrase by Phrase for Faster Speculative Decoding [65.94521678103237]
Speculative decoding is a widely used method that accelerates the generation process of large language models. We introduce Ouroboros, which can generate draft phrases to parallelize the drafting process. Ouroboros can achieve speedups of up to $2.8times$ over speculative decoding and $3.9times$ over vanilla decoding.
arXiv Detail & Related papers (2024-02-21T11:31:28Z)
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty [25.703729145091483]
Autoregression at the feature (second-to-top-layer) level is more straightforward than at the token level.<n>The inherent uncertainty in feature (second-to-top-layer) level autoregression constrains its performance.
arXiv Detail & Related papers (2024-01-26T18:59:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.