SpecExtend: A Drop-in Enhancement for Speculative Decoding of Long Sequences
- URL: http://arxiv.org/abs/2505.20776v3
- Date: Mon, 29 Sep 2025 12:34:50 GMT
- Title: SpecExtend: A Drop-in Enhancement for Speculative Decoding of Long Sequences
- Authors: Jungyoub Cha, Hyunjong Kim, Sungzoon Cho,
- Abstract summary: SpecExtend improves speculative decoding on long sequences without additional training.<n>To improve both draft accuracy and speed on long inputs without retraining, we propose Cross-model Retrieval.<n>SpecExtend accelerates speculative decoding by up to 2.84x on 16K-token long summarization and up to 3.86x on long reasoning.
- Score: 11.225649178057695
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Speculative decoding is a widely used technique for accelerating inference in large language models (LLMs), but its performance degrades as input length grows, with significant drops even at moderate lengths. Yet, this early degradation has remained largely underexplored. We introduce SpecExtend, a drop-in enhancement that improves speculative decoding on long sequences without additional training. SpecExtend integrates efficient attention mechanisms such as FlashAttention and Hybrid Tree Attention to accelerate prefill and verification steps. To improve both draft accuracy and speed on long inputs without retraining, we propose Cross-model Retrieval, a novel KV cache eviction strategy that leverages the target model's attention scores to dynamically select relevant context for the smaller draft model. Extensive evaluations show that SpecExtend accelerates speculative decoding by up to 2.84x on 16K-token long summarization and up to 3.86x on long reasoning, while preserving the short-input performance of state-of-the-art frameworks. Our code is available at https://github.com/jycha98/SpecExtend .
Related papers
- Length-Adaptive Interest Network for Balancing Long and Short Sequence Modeling in CTR Prediction [50.094751096858204]
LAIN is a plug-and-play framework that incorporates sequence length as a conditioning signal to balance long- and short-sequence modeling.<n>Our work offers a general, efficient, and deployable solution to mitigate length-induced bias in sequential recommendation.
arXiv Detail & Related papers (2026-01-27T03:14:20Z) - Training-free Context-adaptive Attention for Efficient Long Context Modeling [57.703159205740185]
Training-free Context-adaptive Attention (TCA-Attention) is a training-free sparse attention mechanism that selectively attends to only the informative tokens for efficient long-context inference.<n>TCA-Attention achieves a 2.8$times$ speedup and reduces KV cache by 61% at 128K context length while maintaining performance comparable to full attention.
arXiv Detail & Related papers (2025-12-10T01:54:57Z) - SpecPV: Improving Self-Speculative Decoding for Long-Context Generation via Partial Verification [11.366541829206199]
Speculative decoding is one of the most direct and effective approaches for accelerating generation.<n>We introduce SpecPV, a self-speculative decoding approach that performs fast verification using partial key-value states.<n>We validate SpecPV across multiple long-context benchmarks and models, including LLaMA-3.1-8B-Instruct and Qwen3-series.
arXiv Detail & Related papers (2025-12-02T02:15:33Z) - InfLLM-V2: Dense-Sparse Switchable Attention for Seamless Short-to-Long Adaptation [56.694702609077495]
Long-sequence processing is a critical capability for modern large language models.<n>InfLLM-V2 is a trainable sparse attention framework that seamlessly adapts models from short to long sequences.<n>In experiments, InfLLM-V2 is 4$times$ faster than dense attention while retaining 98.1% and 99.7% of the performance.
arXiv Detail & Related papers (2025-09-29T12:08:33Z) - SpecExit: Accelerating Large Reasoning Model via Speculative Exit [10.522333173441453]
We propose SpecExit, a framework that predicts both future tokens and an early-exit signal directly from a draft model without probing overhead.<n>Our method offers significant improvements, reducing average generation length by 66% and achieving a 2.5x speedup in end-to-end latency.
arXiv Detail & Related papers (2025-09-29T03:39:32Z) - Rectified Sparse Attention [61.7702154360081]
Efficient long-sequence generation is a critical challenge for Large Language Models.<n>We propose Rectified Sparse Attention (ReSA), a simple yet effective method that combines block-sparse attention with periodic dense rectification.<n> Experiments across math reasoning, language modeling, and retrieval tasks demonstrate that ReSA achieves near-lossless generation quality.
arXiv Detail & Related papers (2025-06-04T16:01:48Z) - DuoDecoding: Hardware-aware Heterogeneous Speculative Decoding with Dynamic Multi-Sequence Drafting [59.57151419673759]
Speculative decoding presents a draft-then-verify framework that reduces generation latency while maintaining output distribution fidelity.<n>We propose DuoDecoding, a novel approach that strategically deploys the draft and target models on the CPU and GPU respectively.<n>Our method incorporates a hardware-aware optimal draft budget to minimize idle times and employs dynamic multi-sequence drafting to enhance draft quality.
arXiv Detail & Related papers (2025-03-02T08:27:48Z) - LongSpec: Long-Context Speculative Decoding with Efficient Drafting and Verification [42.54363549922909]
Speculative decoding has become a promising technique to mitigate the high inference latency of autoregressive decoding in Large Language Models.<n>Despite its promise, the effective application of speculative decoding in LLMs still confronts three key challenges.<n>We enhance the performance of speculative decoding in long-context settings by addressing these challenges.
arXiv Detail & Related papers (2025-02-24T18:53:31Z) - QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache [67.84112700032007]
Large Language Models (LLMs) are increasingly being deployed on edge devices for long-context settings.<n>In these scenarios, the Key-Value ( KV) cache is the primary bottleneck in terms of both GPU memory and latency.<n>We propose a novel self-speculative decoding framework, QuantSpec, where the draft model shares the architecture of the target model but employs a hierarchical 4-bit quantized KV cache and 4-bit quantized weights for acceleration.
arXiv Detail & Related papers (2025-02-05T20:43:48Z) - Falcon: Faster and Parallel Inference of Large Language Models through Enhanced Semi-Autoregressive Drafting and Custom-Designed Decoding Tree [7.438117410146904]
Falcon is an innovative speculative decoding framework fashioned to augment both the drafter's parallelism and output quality.<n>Falcon incorporates the Coupled Sequential Glancing Distillation technique, which fortifies inter-token dependencies within the same block, leading to increased speculation accuracy.
arXiv Detail & Related papers (2024-12-17T08:02:08Z) - RefreshKV: Updating Small KV Cache During Long-form Generation [54.00118604124301]
We propose a new inference method, RefreshKV, that flexibly alternates between full context attention and attention over a subset of input tokens during generation.<n>Applying our method to off-the-shelf LLMs achieves comparable speedup to eviction-based methods while improving performance for various long-form generation tasks.
arXiv Detail & Related papers (2024-11-08T18:57:07Z) - MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding [12.74265334789358]
We show that speculative decoding can achieve speedup even for a high throughput inference regime for moderate to long sequences.<n>We propose a theoretical model to select the optimal drafting strategy for maximum speedup.<n>For moderate to long sequences, we demonstrate up to 2.51x speedup for Llama3.1-8B when serving batch sizes ranging from 32 to 256.
arXiv Detail & Related papers (2024-08-20T17:57:31Z) - Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion [55.0194604505437]
Speculative decoding has emerged as a widely adopted method to accelerate large language model inference.<n>This paper proposes an adaptation of speculative decoding which uses discrete diffusion models to generate draft sequences.
arXiv Detail & Related papers (2024-08-10T21:24:25Z) - Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers [58.5711048151424]
We introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome computational and memory obstacles.
Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query.
Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods.
arXiv Detail & Related papers (2024-06-24T15:55:59Z) - Speculative Streaming: Fast LLM Inference without Auxiliary Models [21.454206732725563]
Speculative Streaming is a single-model speculative decoding method.
It fuses drafting into the target model by changing the fine-tuning objective from next token prediction to future n-gram prediction.
It speeds up decoding by 1.8 - 3.1X in a diverse set of tasks.
arXiv Detail & Related papers (2024-02-16T23:36:43Z) - Speculative Decoding: Exploiting Speculative Execution for Accelerating
Seq2seq Generation [80.2267931231335]
We propose Speculative Decoding (SpecDec) to study exploiting the idea of speculative execution to accelerate autoregressive (AR) decoding.
SpecDec has two innovations: Spec-Drafter -- an independent model specially optimized for efficient drafting, and Spec-Verification -- a reliable method for verifying the drafted tokens efficiently.
arXiv Detail & Related papers (2022-03-30T17:27:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.