CAS-Spec: Cascade Adaptive Self-Speculative Decoding for On-the-Fly Lossless Inference Acceleration of LLMs
- URL: http://arxiv.org/abs/2510.26843v1
- Date: Thu, 30 Oct 2025 08:51:29 GMT
- Title: CAS-Spec: Cascade Adaptive Self-Speculative Decoding for On-the-Fly Lossless Inference Acceleration of LLMs
- Authors: Zhiyuan Ning, Jiawei Shao, Ruge Xu, Xinfei Guo, Jun Zhang, Chi Zhang, Xuelong Li,
- Abstract summary: Speculative decoding offers seamless integration and broad utility when deploying large language models.<n>Cascading a hierarchy of draft models promises further acceleration and flexibility, but the high cost of training multiple models has limited its practical application.<n>We propose a novel Cascade Adaptive Self-Speculative Decoding (CAS-Spec) method which constructs speculative draft models.
- Score: 48.8252978488871
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Speculative decoding has become a widely adopted as an effective technique for lossless inference acceleration when deploying large language models (LLMs). While on-the-fly self-speculative methods offer seamless integration and broad utility, they often fall short of the speed gains achieved by methods relying on specialized training. Cascading a hierarchy of draft models promises further acceleration and flexibility, but the high cost of training multiple models has limited its practical application. In this paper, we propose a novel Cascade Adaptive Self-Speculative Decoding (CAS-Spec) method which constructs speculative draft models by leveraging dynamically switchable inference acceleration (DSIA) strategies, including layer sparsity and activation quantization. Furthermore, traditional vertical and horizontal cascade algorithms are inefficient when applied to self-speculative decoding methods. We introduce a Dynamic Tree Cascade (DyTC) algorithm that adaptively routes the multi-level draft models and assigns the draft lengths, based on the heuristics of acceptance rates and latency prediction. Our CAS-Spec method achieves state-of-the-art acceleration compared to existing on-the-fly speculative decoding methods, with an average speedup from $1.1\times$ to $2.3\times$ over autoregressive decoding across various LLMs and datasets. DyTC improves the average speedup by $47$\% and $48$\% over cascade-based baseline and tree-based baseline algorithms, respectively. CAS-Spec can be easily integrated into most existing LLMs and holds promising potential for further acceleration as self-speculative decoding techniques continue to evolve.
Related papers
- KnapSpec: Self-Speculative Decoding via Adaptive Layer Selection as a Knapsack Problem [12.668341559890605]
We propose KnapSpec, a training-free framework that reformulates draft model selection as a knapsack problem to maximize tokens-per-time throughput.<n>We provide the first rigorous theoretical analysis establishing cosine similarity between hidden states as a mathematically sound proxy for the token acceptance rate.<n>Our experiments on Qwen3 and Llama3 demonstrate that KnapSpec consistently outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2026-02-23T08:13:03Z) - Adaptive Structured Pruning of Convolutional Neural Networks for Time Series Classification [0.776514389034479]
We propose Dynamic Structured Pruning (DSP), a fully automatic, structured pruning framework for convolution-based TSC models.<n>Our approach achieves an average compression of 58% for LITETime and 75% for InceptionTime architectures while maintaining classification accuracy.<n>Redundancy analyses confirm that DSP produces compact and informative representations, offering a practical path for scalable and efficient deep TSC deployment.
arXiv Detail & Related papers (2026-02-13T09:18:59Z) - Scaling LLM Speculative Decoding: Non-Autoregressive Forecasting in Large-Batch Scenarios [76.85739138203014]
We present SpecFormer, a novel architecture that accelerates unidirectional and attention mechanisms.<n>We demonstrate that SpecFormer achieves lower training demands and reduced computational costs.
arXiv Detail & Related papers (2025-11-25T14:20:08Z) - Fast Inference via Hierarchical Speculative Decoding [65.40448210801763]
We introduce Hierarchical Speculative Decoding (HSD), an algorithm that stacks draft models into a hierarchy, where each model proposes tokens, and the next larger model verifies them in a single forward pass.<n>HSD gives up to 1.2x speed-up over the best single-draft baseline.
arXiv Detail & Related papers (2025-10-22T15:56:19Z) - TokenTiming: A Dynamic Alignment Method for Universal Speculative Decoding Model Pairs [12.056664630923896]
Speculative decoding substantially improves inference efficiency.<n>It is limited by a fundamental constraint: the draft and target models must share the same vocabulary.<n>We propose the algorithm TokenTiming for universal speculative decoding.
arXiv Detail & Related papers (2025-10-17T11:25:36Z) - FastGRPO: Accelerating Policy Optimization via Concurrency-aware Speculative Decoding and Online Draft Learning [11.68914161151634]
Group relative policy optimization (GRPO) has demonstrated significant potential in improving the reasoning capabilities of large language models.<n>We propose a speculative decoding framework that adjusts the drafting and verification strategy according to real-time levels.<n>We show that the proposed method achieves end-to-end speedups of 2.35x to 2.72x, significantly surpassing baseline approaches in efficiency.
arXiv Detail & Related papers (2025-09-26T02:48:41Z) - CLaSp: In-Context Layer Skip for Self-Speculative Decoding [20.800300833576035]
We propose CLaSp, an in-context layer-skipping strategy for self-speculative decoding.<n>Unlike prior methods, CLaSp does not require additional drafting modules or extra training.<n>CLaSp achieves a speedup of 1.3x 1.7x on LLaMA3 series models without altering the original distribution of the generated text.
arXiv Detail & Related papers (2025-05-30T04:15:06Z) - QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache [67.84112700032007]
Large Language Models (LLMs) are increasingly being deployed on edge devices for long-context settings.<n>In these scenarios, the Key-Value ( KV) cache is the primary bottleneck in terms of both GPU memory and latency.<n>We propose a novel self-speculative decoding framework, QuantSpec, where the draft model shares the architecture of the target model but employs a hierarchical 4-bit quantized KV cache and 4-bit quantized weights for acceleration.
arXiv Detail & Related papers (2025-02-05T20:43:48Z) - Search for Efficient Large Language Models [52.98684997131108]
Large Language Models (LLMs) have long held sway in the realms of artificial intelligence research.
Weight pruning, quantization, and distillation have been embraced to compress LLMs, targeting memory reduction and inference acceleration.
Most model compression techniques concentrate on weight optimization, overlooking the exploration of optimal architectures.
arXiv Detail & Related papers (2024-09-25T21:32:12Z) - A-SDM: Accelerating Stable Diffusion through Model Assembly and Feature Inheritance Strategies [51.7643024367548]
Stable Diffusion Model is a prevalent and effective model for text-to-image (T2I) and image-to-image (I2I) generation.
This study focuses on reducing redundant computation in SDM and optimizing the model through both tuning and tuning-free methods.
arXiv Detail & Related papers (2024-05-31T21:47:05Z) - Cascade Speculative Drafting for Even Faster LLM Inference [24.199907533534535]
Speculative decoding improves the efficiency of large language model (LLM) inference.<n>We introduce Cascade Speculative Drafting, a speculative execution algorithm that incorporates two types of cascades.<n>CS Drafting achieves greater speedup compared to the baselines in our experiments, while preserving the same output distribution as the target model.
arXiv Detail & Related papers (2023-12-18T18:59:46Z) - Layer Pruning on Demand with Intermediate CTC [50.509073206630994]
We present a training and pruning method for ASR based on the connectionist temporal classification (CTC)
We show that a Transformer-CTC model can be pruned in various depth on demand, improving real-time factor from 0.005 to 0.002 on GPU.
arXiv Detail & Related papers (2021-06-17T02:40:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.