Fast Inference via Hierarchical Speculative Decoding
- URL: http://arxiv.org/abs/2510.19705v2
- Date: Thu, 23 Oct 2025 14:15:48 GMT
- Title: Fast Inference via Hierarchical Speculative Decoding
- Authors: Clara Mohri, Haim Kaplan, Tal Schuster, Yishay Mansour, Amir Globerson,
- Abstract summary: We introduce Hierarchical Speculative Decoding (HSD), an algorithm that stacks draft models into a hierarchy, where each model proposes tokens, and the next larger model verifies them in a single forward pass.<n>HSD gives up to 1.2x speed-up over the best single-draft baseline.
- Score: 65.40448210801763
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformer language models generate text autoregressively, making inference latency proportional to the number of tokens generated. Speculative decoding reduces this latency without sacrificing output quality, by leveraging a small draft model to propose tokens that the larger target model verifies in parallel. In practice, however, there may exist a set of potential draft models- ranging from faster but less inaccurate, to slower yet more reliable. We introduce Hierarchical Speculative Decoding (HSD), an algorithm that stacks these draft models into a hierarchy, where each model proposes tokens, and the next larger model verifies them in a single forward pass, until finally the target model verifies tokens. We derive an expression for the expected latency of any such hierarchy and show that selecting the latency-optimal hierarchy can be done in polynomial time. Empirically, HSD gives up to 1.2x speed-up over the best single-draft baseline, demonstrating the practicality of our algorithm in reducing generation latency beyond previous techniques.
Related papers
- TokenTiming: A Dynamic Alignment Method for Universal Speculative Decoding Model Pairs [12.056664630923896]
Speculative decoding substantially improves inference efficiency.<n>It is limited by a fundamental constraint: the draft and target models must share the same vocabulary.<n>We propose the algorithm TokenTiming for universal speculative decoding.
arXiv Detail & Related papers (2025-10-17T11:25:36Z) - CARD: A Cache-Assisted Parallel Speculative Decoding Framework via Query-and-Correct Paradigm for Accelerating LLM Inference [14.527697328189362]
We propose a speculative decoding framework called CARD, which employs a novel query-and-correct paradigm.<n>Our approach decouples drafting from verification, effectively leveraging the draft model's efficiency without additional fine-tuning.<n> CARD significantly outperforms existing state-of-the-art methods, achieving up to a 4.83x acceleration over vanilla autoregressive decoding.
arXiv Detail & Related papers (2025-08-06T14:02:10Z) - AdaDecode: Accelerating LLM Decoding with Adaptive Layer Parallelism [17.858104076062897]
Large language models (LLMs) are increasingly used for long-content generation.<n>We propose AdaDecode, which accelerates decoding without requiring auxiliary models or changes to the original model parameters.<n>AdaDecode consistently achieves superior decoding throughput with up to 1.73x speedup.
arXiv Detail & Related papers (2025-06-04T08:32:30Z) - Accelerating Diffusion LLMs via Adaptive Parallel Decoding [60.407727995313074]
We introduce adaptive parallel decoding (APD), a novel method that dynamically adjusts the number of tokens sampled in parallel.<n>APD provides markedly higher throughput with minimal quality degradations on downstream benchmarks.
arXiv Detail & Related papers (2025-05-31T06:10:10Z) - DuoDecoding: Hardware-aware Heterogeneous Speculative Decoding with Dynamic Multi-Sequence Drafting [59.57151419673759]
Speculative decoding presents a draft-then-verify framework that reduces generation latency while maintaining output distribution fidelity.<n>We propose DuoDecoding, a novel approach that strategically deploys the draft and target models on the CPU and GPU respectively.<n>Our method incorporates a hardware-aware optimal draft budget to minimize idle times and employs dynamic multi-sequence drafting to enhance draft quality.
arXiv Detail & Related papers (2025-03-02T08:27:48Z) - Lossless Acceleration of Large Language Models with Hierarchical Drafting based on Temporal Locality in Speculative Decoding [11.167833073080612]
Accelerating inference in Large Language Models (LLMs) is critical for real-time interactions.<n> Speculative decoding has gained attention for improving inference speed by drafting and verifying tokens, thereby generating multiple tokens in a single forward pass.<n>We propose Hierarchy Drafting (HD), a novel lossless drafting approach that organizes various token sources into multiple databases in a hierarchical framework based on temporal locality.<n>Our experiments on Spec-Bench using LLMs with 7B and 13B parameters demonstrate that HD outperforms existing database drafting methods, achieving robust inference speedups across model sizes, tasks, and temperatures.
arXiv Detail & Related papers (2025-02-08T15:32:53Z) - COrAL: Order-Agnostic Language Modeling for Efficient Iterative Refinement [80.18490952057125]
Iterative refinement has emerged as an effective paradigm for enhancing the capabilities of large language models (LLMs) on complex tasks.
We propose Context-Wise Order-Agnostic Language Modeling (COrAL) to overcome these challenges.
Our approach models multiple token dependencies within manageable context windows, enabling the model to perform iterative refinement internally.
arXiv Detail & Related papers (2024-10-12T23:56:19Z) - Improving Multi-candidate Speculative Decoding [1.6291177798903276]
Speculative Decoding (SD) is a technique to accelerate the inference of Large Language Models (LLMs)<n>In this work, we introduce a new version of MCSD that includes a target model multi-candidate generation.<n>We also evaluate the effects of using the target model multi-candidate process with different draft models on output quality.
arXiv Detail & Related papers (2024-09-16T18:20:38Z) - Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion [55.0194604505437]
Speculative decoding has emerged as a widely adopted method to accelerate large language model inference.<n>This paper proposes an adaptation of speculative decoding which uses discrete diffusion models to generate draft sequences.
arXiv Detail & Related papers (2024-08-10T21:24:25Z) - Multi-Candidate Speculative Decoding [82.05519287513444]
Large language models have shown impressive capabilities across a variety of NLP tasks, yet their generating text autoregressively is time-consuming.
One way to speed them up is speculative decoding, which generates candidate segments from a fast draft model that is then verified in parallel by the target model.
This paper proposes sampling multiple candidates from a draft model and then organising them in batches for verification.
We design algorithms for efficient multi-candidate verification while maintaining the distribution of the target model.
arXiv Detail & Related papers (2024-01-12T17:15:23Z) - Fast and Robust Early-Exiting Framework for Autoregressive Language
Models with Synchronized Parallel Decoding [43.659680579686544]
We propose a Fast and Robust Early-Exiting framework, which incorporates a shallow-deep module and a synchronized parallel decoding.
Our framework enables faster inference by synchronizing the decoding process of the current token with previously stacked early-exited tokens.
As parallel decoding allows us to observe predictions from both shallow and deep models, we present a novel adaptive threshold estimator.
arXiv Detail & Related papers (2023-10-09T05:53:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.