Overcoming Joint Intractability with Lossless Hierarchical Speculative Decoding
- URL: http://arxiv.org/abs/2601.05724v1
- Date: Fri, 09 Jan 2026 11:10:29 GMT
- Title: Overcoming Joint Intractability with Lossless Hierarchical Speculative Decoding
- Authors: Yuxuan Zhou, Fei Huang, Heng Li, Fengyi Wu, Tianyu Wang, Jianwei Zhang, Junyang Lin, Zhi-Qi Cheng,
- Abstract summary: We propose provably lossless.<n> verification method that significantly boosts the expected number of accepted tokens.<n>We show that HSD yields consistent improvements in acceptance rates across diverse model families and benchmarks.
- Score: 58.92526489742584
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Verification is a key bottleneck in improving inference speed while maintaining distribution fidelity in Speculative Decoding. Recent work has shown that sequence-level verification leads to a higher number of accepted tokens compared to token-wise verification. However, existing solutions often rely on surrogate approximations or are constrained by partial information, struggling with joint intractability. In this work, we propose Hierarchical Speculative Decoding (HSD), a provably lossless verification method that significantly boosts the expected number of accepted tokens and overcomes joint intractability by balancing excess and deficient probability mass across accessible branches. Our extensive large-scale experiments demonstrate that HSD yields consistent improvements in acceptance rates across diverse model families and benchmarks. Moreover, its strong explainability and generality make it readily integrable into a wide range of speculative decoding frameworks. Notably, integrating HSD into EAGLE-3 yields over a 12% performance gain, establishing state-of-the-art decoding efficiency without compromising distribution fidelity. Code is available at https://github.com/ZhouYuxuanYX/Hierarchical-Speculative-Decoding.
Related papers
- CVeDRL: An Efficient Code Verifier via Difficulty-aware Reinforcement Learning [57.24524263804788]
Code verifiers play a critical role in post-verification for LLM-based code generation.<n>Existing supervised fine-tuning methods suffer from data scarcity, high failure rates, and poor inference efficiency.<n>We show that naive RL with only functionality rewards fails to generate effective unit tests for difficult branches and samples.
arXiv Detail & Related papers (2026-01-30T10:33:29Z) - Multi-Scale Local Speculative Decoding for Image Generation [10.239314110594249]
We introduce Multi-Scale Local Speculative Decoding (MuLo-SD)<n>MuLo-SD combines multi-resolution drafting with spatially informed verification to accelerate AR image generation.<n>We demonstrate that MuLo-SD achieves substantial speedups up to $mathbf1.7times$.
arXiv Detail & Related papers (2026-01-08T17:39:35Z) - Accelerate Speculative Decoding with Sparse Computation in Verification [49.74839681322316]
Speculative decoding accelerates autoregressive language model inference by verifying multiple draft tokens in parallel.<n>Existing sparsification methods are designed primarily for standard token-by-token autoregressive decoding.<n>We propose a sparse verification framework that jointly sparsifies attention, FFN, and MoE components during the verification stage to reduce the dominant computation cost.
arXiv Detail & Related papers (2025-12-26T07:53:41Z) - Eigen-1: Adaptive Multi-Agent Refinement with Monitor-Based RAG for Scientific Reasoning [53.45095336430027]
We develop a unified framework that combines implicit retrieval and structured collaboration.<n>On Humanity's Last Exam (HLE) Bio/Chem Gold, our framework achieves 48.3% accuracy.<n>Results on SuperGPQA and TRQA confirm robustness across domains.
arXiv Detail & Related papers (2025-09-25T14:05:55Z) - SIM-CoT: Supervised Implicit Chain-of-Thought [108.30049193668083]
Implicit Chain-of-Thought (CoT) methods offer a token-efficient alternative to explicit CoT reasoning in Large Language Models.<n>We identify a core latent instability issue when scaling the computational budget of implicit CoT.<n>We propose SIM-CoT, a plug-and-play training module that introduces step-level supervision to stabilize and enrich the latent reasoning space.
arXiv Detail & Related papers (2025-09-24T17:01:32Z) - Uncertainty-Masked Bernoulli Diffusion for Camouflaged Object Detection Refinement [24.522233459116354]
Camouflaged Object Detection (COD) presents inherent challenges due to subtle visual differences between targets and their backgrounds.<n>We propose the Uncertainty-Masked Bernoulli Diffusion (UMBD) model, the first generative refinement framework specifically designed for COD.<n>UMBD introduces an uncertainty-guided masking mechanism that selectively applies Bernoulli diffusion to residual regions with poor segmentation quality.
arXiv Detail & Related papers (2025-06-12T14:02:18Z) - Towards Better Code Generation: Adaptive Decoding with Uncertainty Guidance [42.737012213197865]
AdaDec is an adaptive decoding framework that employs a lookahead-based, uncertainty-aware pause-and-rerank mechanism.<n>AdaDec achieves up to 20.9% absolute gains in Pass@1 accuracy compared with greedy decoding.<n>By applying reranking only when necessary, AdaDec reduces computational overhead and latency, enhancing efficiency alongside reliability.
arXiv Detail & Related papers (2025-06-10T16:49:46Z) - Query Encoder Distillation via Embedding Alignment is a Strong Baseline
Method to Boost Dense Retriever Online Efficiency [4.254906060165999]
We show that even a 2-layer, BERT-based query encoder can still retain 92.5% of the full DE performance on the BEIR benchmark.
We hope that our findings will encourage the community to re-evaluate the trade-offs between method complexity and performance improvements.
arXiv Detail & Related papers (2023-06-05T06:53:55Z) - Pairwise Supervised Hashing with Bernoulli Variational Auto-Encoder and
Self-Control Gradient Estimator [62.26981903551382]
Variational auto-encoders (VAEs) with binary latent variables provide state-of-the-art performance in terms of precision for document retrieval.
We propose a pairwise loss function with discrete latent VAE to reward within-class similarity and between-class dissimilarity for supervised hashing.
This new semantic hashing framework achieves superior performance compared to the state-of-the-arts.
arXiv Detail & Related papers (2020-05-21T06:11:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.