Nightjar: Dynamic Adaptive Speculative Decoding for Large Language Models Serving
- URL: http://arxiv.org/abs/2512.22420v1
- Date: Sat, 27 Dec 2025 00:57:55 GMT
- Title: Nightjar: Dynamic Adaptive Speculative Decoding for Large Language Models Serving
- Authors: Rui Li, Zhaoning Zhang, Libo Zhang, Huaimin Wang, Xiang Fu, Zhiquan Lai,
- Abstract summary: Nightjar is a novel learning-based algorithm for adaptive speculative inference.<n>Nightjar achieves up to 14.8% higher throughput and 20.2% lower latency compared to standard speculative decoding.
- Score: 20.36184808907598
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Speculative decoding (SD) accelerates LLM inference by verifying draft tokens in parallel. However, this method presents a critical trade-off: it improves throughput in low-load, memory-bound systems but degrades performance in high-load, compute-bound environments due to verification overhead. Current SD implementations use a fixed speculative length, failing to adapt to dynamic request rates and creating a significant performance bottleneck in real-world serving scenarios. To overcome this, we propose Nightjar, a novel learning-based algorithm for adaptive speculative inference that adjusts to request load by dynamically selecting the optimal speculative length for different batch sizes and even disabling speculative decoding when it provides no benefit. Experiments show that Nightjar achieves up to 14.8% higher throughput and 20.2% lower latency compared to standard speculative decoding, demonstrating robust efficiency for real-time serving.
Related papers
- AdaSD: Adaptive Speculative Decoding for Efficient Language Model Inference [1.1852406625172216]
We propose Adaptive Speculative Decoding (AdaSD) for large language models (LLMs)<n>AdaSD dynamically adjusts generation length and acceptance criteria during inference.<n> Experiments on benchmark datasets demonstrate that AdaSD achieves up to 49% speedup over standard speculative decoding.
arXiv Detail & Related papers (2025-12-12T04:56:08Z) - Training-free Context-adaptive Attention for Efficient Long Context Modeling [57.703159205740185]
Training-free Context-adaptive Attention (TCA-Attention) is a training-free sparse attention mechanism that selectively attends to only the informative tokens for efficient long-context inference.<n>TCA-Attention achieves a 2.8$times$ speedup and reduces KV cache by 61% at 128K context length while maintaining performance comparable to full attention.
arXiv Detail & Related papers (2025-12-10T01:54:57Z) - Saber: An Efficient Sampling with Adaptive Acceleration and Backtracking Enhanced Remasking for Diffusion Language Model [98.35868970993232]
Diffusion language models (DLMs) are emerging as a powerful and promising alternative to the dominant autoregressive paradigm.<n>We introduce efficient Sampling with Adaptive acceleration and Backtracking Enhanced Remasking (i.e., Saber) to achieve better inference speed and output quality in code generation.
arXiv Detail & Related papers (2025-10-20T23:38:12Z) - Free Draft-and-Verification: Toward Lossless Parallel Decoding for Diffusion Large Language Models [8.407364705777587]
We introduce Free Draft-and-Verification (FreeDave), a novel fast decoding algorithm tailored forDLLMs.<n>FreeDave is proven to boost the inference throughput up to $3.78times$ without performance degradation.
arXiv Detail & Related papers (2025-09-30T21:28:04Z) - dParallel: Learnable Parallel Decoding for dLLMs [77.24184219948337]
Diffusion large language models (dLLMs) offer parallel token prediction and lower inference latency.<n>Existing open-source models still require nearly token-length decoding steps to ensure performance.<n>We introduce dParallel, a simple and effective method that unlocks the inherent parallelism of dLLMs for fast sampling.
arXiv Detail & Related papers (2025-09-30T16:32:52Z) - Wide-In, Narrow-Out: Revokable Decoding for Efficient and Effective DLLMs [57.69190972274813]
Diffusion Large Language Models (DLLMs) have emerged as a compelling alternative to Autoregressive models.<n>ExistingDLLMs are plagued by a severe quality-speed trade-off, where faster parallel decoding leads to significant performance degradation.<n>We introduce Wide-In, Narrow-Out (WINO), a training-free decoding algorithm that enables revokable decoding inDLLMs.
arXiv Detail & Related papers (2025-07-24T16:51:33Z) - Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding [51.711605076319216]
Diffusion-based large language models (Diffusion LLMs) have shown promise for non-autoregressive text generation with parallel decoding capabilities.<n>We introduce a novel block-wise approximate KV Cache mechanism tailored for bidirectional diffusion models, enabling cache reuse with negligible performance drop.<n>We propose a confidence-aware parallel decoding strategy that selectively decodes tokens exceeding a confidence threshold, mitigating dependency violations and maintaining generation quality.
arXiv Detail & Related papers (2025-05-28T17:39:15Z) - SpecExtend: A Drop-in Enhancement for Speculative Decoding of Long Sequences [11.225649178057695]
SpecExtend improves speculative decoding on long sequences without additional training.<n>To improve both draft accuracy and speed on long inputs without retraining, we propose Cross-model Retrieval.<n>SpecExtend accelerates speculative decoding by up to 2.84x on 16K-token long summarization and up to 3.86x on long reasoning.
arXiv Detail & Related papers (2025-05-27T06:30:00Z) - Constrained Decoding with Speculative Lookaheads [13.085794785286305]
We propose constrained decoding with speculative lookaheads (CSL)<n>CSL is motivated by the recently proposed idea of speculative decoding that uses a much smaller draft LLM for generation and a larger target LLM for verification.<n>We evaluate CDSL in two constraint decoding tasks with three LLM families and achieve 2.2x to 12.15x speedup over CDLH without significant performance reduction.
arXiv Detail & Related papers (2024-12-09T22:29:57Z) - MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding [12.74265334789358]
We show that speculative decoding can achieve speedup even for a high throughput inference regime for moderate to long sequences.<n>We propose a theoretical model to select the optimal drafting strategy for maximum speedup.<n>For moderate to long sequences, we demonstrate up to 2.51x speedup for Llama3.1-8B when serving batch sizes ranging from 32 to 256.
arXiv Detail & Related papers (2024-08-20T17:57:31Z) - Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion [55.0194604505437]
Speculative decoding has emerged as a widely adopted method to accelerate large language model inference.<n>This paper proposes an adaptation of speculative decoding which uses discrete diffusion models to generate draft sequences.
arXiv Detail & Related papers (2024-08-10T21:24:25Z) - TurboSpec: Closed-loop Speculation Control System for Optimizing LLM Serving Goodput [37.56866491624234]
Large Language Model (LLM) serving systems batch concurrent user requests to achieve efficient serving.<n>We present TurboSpec, a speculation control system that automatically profiles the execution environment.<n>We demonstrate its effectiveness across diverse workloads and hardware configurations.
arXiv Detail & Related papers (2024-06-20T07:43:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.