Related papers: DuoDecoding: Hardware-aware Heterogeneous Speculative Decoding with Dynamic Multi-Sequence Drafting

DuoDecoding: Hardware-aware Heterogeneous Speculative Decoding with Dynamic Multi-Sequence Drafting

URL: http://arxiv.org/abs/2503.00784v1
Date: Sun, 02 Mar 2025 08:27:48 GMT
Title: DuoDecoding: Hardware-aware Heterogeneous Speculative Decoding with Dynamic Multi-Sequence Drafting
Authors: Kai Lv, Honglin Guo, Qipeng Guo, Xipeng Qiu,
Abstract summary: Speculative decoding presents a draft-then-verify framework that reduces generation latency while maintaining output distribution fidelity.<n>We propose DuoDecoding, a novel approach that strategically deploys the draft and target models on the CPU and GPU respectively.<n>Our method incorporates a hardware-aware optimal draft budget to minimize idle times and employs dynamic multi-sequence drafting to enhance draft quality.
Score: 59.57151419673759
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) exhibit exceptional performance across a wide range of tasks; however, their token-by-token autoregressive generation process significantly hinders inference speed. Speculative decoding presents a promising draft-then-verify framework that reduces generation latency while maintaining output distribution fidelity. Nevertheless, the draft model introduces additional computational overhead, becoming a performance bottleneck and increasing the time to first token (TTFT). Previous approaches to mitigate draft model overhead have primarily relied on heuristics and generally failed to match the quality of the draft language models. To address these challenges, we propose DuoDecoding, a novel approach that strategically deploys the draft and target models on the CPU and GPU respectively, enabling parallel decoding while preserving draft quality. Our method incorporates a hardware-aware optimal draft budget to minimize idle times and employs dynamic multi-sequence drafting to enhance draft quality. Extensive experiments across seven tasks show that DuoDecoding achieves up to 2.61x speedup in generation latency, while reducing TTFT to 83% of that in conventional speculative decoding. The Code is available at https://github.com/KaiLv69/DuoDecoding.

Related papers

Scaling LLM Speculative Decoding: Non-Autoregressive Forecasting in Large-Batch Scenarios [76.85739138203014]
We present SpecFormer, a novel architecture that accelerates unidirectional and attention mechanisms.<n>We demonstrate that SpecFormer achieves lower training demands and reduced computational costs.
arXiv Detail & Related papers (2025-11-25T14:20:08Z)
SpecDiff-2: Scaling Diffusion Drafter Alignment For Faster Speculative Decoding [48.96349422252313]
Speculative decoding has become the standard approach for accelerating Large Language Model (LLM) inference.<n>It exploits a lossless draft-then-verify procedure to circumvent the latency of autoregressive decoding, achieving impressive speed-ups.<n>This paper proposes SpecDiff-2, a novel framework to jointly address these two bottlenecks.
arXiv Detail & Related papers (2025-11-01T16:12:56Z)
Hawk: Leveraging Spatial Context for Faster Autoregressive Text-to-Image Generation [87.00172597953228]
Speculative decoding has shown promise in accelerating text generation without compromising quality.<n>We introduce Hawk, a new approach that harnesses the spatial structure of images to guide the speculative model toward more accurate and efficient predictions.<n> Experimental results on multiple text-to-image benchmarks demonstrate a 1.71x speedup over standard AR models.
arXiv Detail & Related papers (2025-10-29T17:43:31Z)
Fast Inference via Hierarchical Speculative Decoding [65.40448210801763]
We introduce Hierarchical Speculative Decoding (HSD), an algorithm that stacks draft models into a hierarchy, where each model proposes tokens, and the next larger model verifies them in a single forward pass.<n>HSD gives up to 1.2x speed-up over the best single-draft baseline.
arXiv Detail & Related papers (2025-10-22T15:56:19Z)
TokenTiming: A Dynamic Alignment Method for Universal Speculative Decoding Model Pairs [12.056664630923896]
Speculative decoding substantially improves inference efficiency.<n>It is limited by a fundamental constraint: the draft and target models must share the same vocabulary.<n>We propose the algorithm TokenTiming for universal speculative decoding.
arXiv Detail & Related papers (2025-10-17T11:25:36Z)
Free Draft-and-Verification: Toward Lossless Parallel Decoding for Diffusion Large Language Models [8.407364705777587]
We introduce Free Draft-and-Verification (FreeDave), a novel fast decoding algorithm tailored forDLLMs.<n>FreeDave is proven to boost the inference throughput up to $3.78times$ without performance degradation.
arXiv Detail & Related papers (2025-09-30T21:28:04Z)
OmniDraft: A Cross-vocabulary, Online Adaptive Drafter for On-device Speculative Decoding [8.589209709453026]
We propose OmniDraft, a unified framework that enables a single draft model to operate with any target model.<n>We introduce an online n-gram cache with hybrid distillation fine-tuning to address the cross-vocabulary mismatch across draft and target models.<n>We showcase the proficiency of the framework by performing online learning on math reasoning, coding and text generation tasks.
arXiv Detail & Related papers (2025-07-03T14:20:41Z)
SpecExtend: A Drop-in Enhancement for Speculative Decoding of Long Sequences [4.268504966623081]
We introduce SpecExtend, a drop-in enhancement that improves the performance of speculative decoding on long sequences.<n>SpecExtend integrates efficient attention mechanisms such as FlashAttention and Hybrid Tree Attention into both the draft and target models.<n>To improve draft accuracy and speed, we propose Cross-model Retrieval, a novel KV cache update strategy.
arXiv Detail & Related papers (2025-05-27T06:30:00Z)
FLASH: Latent-Aware Semi-Autoregressive Speculative Decoding for Multimodal Tasks [41.04727840852988]
Large language and multimodal models (LLMs and LMMs) exhibit strong inference capabilities but are often limited by slow decoding speeds.<n>This challenge is especially acute in LMMs, where visual inputs typically comprise more tokens with lower information density than text.<n>We propose textbfFLASH (Fast Latent-Aware Semi-Autoregressive Heuristics), a speculative decoding framework designed specifically for LMMs.
arXiv Detail & Related papers (2025-05-19T05:35:30Z)
ParallelSpec: Parallel Drafter for Efficient Speculative Decoding [62.68430939686566]
We present ParallelSpec, an alternative to auto-regressive drafting strategies in state-of-the-art speculative decoding approaches. In contrast to auto-regressive drafting in the speculative stage, we train a parallel drafter to serve as an efficient speculative model.
arXiv Detail & Related papers (2024-10-08T01:05:08Z)
PEARL: Parallel Speculative Decoding with Adaptive Draft Length [12.166703341906242]
We propose a conceptually simple, flexible, and general framework to boost speculative decoding, namely Parallel spEculative decoding with Adaptive dRaft Length (PEARL) PEARL proposes pre-verify to verify the first draft token in advance during the drafting phase, and post-verify to generate more draft tokens during the verification phase. Experiments on various text generation benchmarks demonstrate the effectiveness of our PEARL, leading to a superior speed up performance up to 4.43$times$ and 1.50$times$, compared to auto-regressive decoding and vanilla speculative decoding, respectively.
arXiv Detail & Related papers (2024-08-13T08:32:06Z)
Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion [55.0194604505437]
Speculative decoding has emerged as a widely adopted method to accelerate large language model inference.<n>This paper proposes an adaptation of speculative decoding which uses discrete diffusion models to generate draft sequences.
arXiv Detail & Related papers (2024-08-10T21:24:25Z)
Chimera: A Lossless Decoding Method for Accelerating Large Language Models Inference by Fusing all Tokens [15.566726645722657]
We propose a novel framework specifically designed for speculative sampling. Within this framework, we introduce a lightweight draft model that effectively utilizes previously generated tokens to predict subsequent words. We demonstrate impressive results, achieving an average latency speedup ratio of 2.7x compared to the vanilla auto-regressive decoding approach.
arXiv Detail & Related papers (2024-02-24T08:10:39Z)
Ouroboros: Generating Longer Drafts Phrase by Phrase for Faster Speculative Decoding [65.94521678103237]
Speculative decoding is a widely used method that accelerates the generation process of large language models. We introduce Ouroboros, which can generate draft phrases to parallelize the drafting process. Ouroboros can achieve speedups of up to $2.8times$ over speculative decoding and $3.9times$ over vanilla decoding.
arXiv Detail & Related papers (2024-02-21T11:31:28Z)
Speculative Streaming: Fast LLM Inference without Auxiliary Models [21.454206732725563]
Speculative Streaming is a single-model speculative decoding method. It fuses drafting into the target model by changing the fine-tuning objective from next token prediction to future n-gram prediction. It speeds up decoding by 1.8 - 3.1X in a diverse set of tasks.
arXiv Detail & Related papers (2024-02-16T23:36:43Z)
Cascade Speculative Drafting for Even Faster LLM Inference [25.642604897018852]
Speculative decoding improves the efficiency of large language model (LLM) inference. We introduce Cascade Speculative Drafting (CS Drafting), a speculative execution algorithm that incorporates two types of cascades. CS Drafting achieves up to an 81 percent additional speedup over speculative decoding in our experiments.
arXiv Detail & Related papers (2023-12-18T18:59:46Z)
DistillSpec: Improving Speculative Decoding via Knowledge Distillation [70.61777015900272]
Speculative decoding (SD) accelerates large language model inference by employing a faster draft model for generating multiple tokens. We propose DistillSpec that uses knowledge distillation to better align the draft model with the target model, before applying SD. We show that DistillSpec yields impressive 10 - 45% speedups over standard SD on a range of standard benchmarks.
arXiv Detail & Related papers (2023-10-12T16:21:04Z)
Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation [80.2267931231335]
We propose Speculative Decoding (SpecDec) to study exploiting the idea of speculative execution to accelerate autoregressive (AR) decoding. SpecDec has two innovations: Spec-Drafter -- an independent model specially optimized for efficient drafting, and Spec-Verification -- a reliable method for verifying the drafted tokens efficiently.
arXiv Detail & Related papers (2022-03-30T17:27:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.