Related papers: Overcoming Joint Intractability with Lossless Hierarchical Speculative Decoding

Overcoming Joint Intractability with Lossless Hierarchical Speculative Decoding

URL: http://arxiv.org/abs/2601.05724v1
Date: Fri, 09 Jan 2026 11:10:29 GMT
Title: Overcoming Joint Intractability with Lossless Hierarchical Speculative Decoding
Authors: Yuxuan Zhou, Fei Huang, Heng Li, Fengyi Wu, Tianyu Wang, Jianwei Zhang, Junyang Lin, Zhi-Qi Cheng,
Abstract summary: We propose provably lossless.<n> verification method that significantly boosts the expected number of accepted tokens.<n>We show that HSD yields consistent improvements in acceptance rates across diverse model families and benchmarks.
Score: 58.92526489742584
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Verification is a key bottleneck in improving inference speed while maintaining distribution fidelity in Speculative Decoding. Recent work has shown that sequence-level verification leads to a higher number of accepted tokens compared to token-wise verification. However, existing solutions often rely on surrogate approximations or are constrained by partial information, struggling with joint intractability. In this work, we propose Hierarchical Speculative Decoding (HSD), a provably lossless verification method that significantly boosts the expected number of accepted tokens and overcomes joint intractability by balancing excess and deficient probability mass across accessible branches. Our extensive large-scale experiments demonstrate that HSD yields consistent improvements in acceptance rates across diverse model families and benchmarks. Moreover, its strong explainability and generality make it readily integrable into a wide range of speculative decoding frameworks. Notably, integrating HSD into EAGLE-3 yields over a 12% performance gain, establishing state-of-the-art decoding efficiency without compromising distribution fidelity. Code is available at https://github.com/ZhouYuxuanYX/Hierarchical-Speculative-Decoding.

Related papers

CVeDRL: An Efficient Code Verifier via Difficulty-aware Reinforcement Learning [57.24524263804788]
Code verifiers play a critical role in post-verification for LLM-based code generation.<n>Existing supervised fine-tuning methods suffer from data scarcity, high failure rates, and poor inference efficiency.<n>We show that naive RL with only functionality rewards fails to generate effective unit tests for difficult branches and samples.
arXiv Detail & Related papers (2026-01-30T10:33:29Z)
Multi-Scale Local Speculative Decoding for Image Generation [10.239314110594249]
We introduce Multi-Scale Local Speculative Decoding (MuLo-SD)<n>MuLo-SD combines multi-resolution drafting with spatially informed verification to accelerate AR image generation.<n>We demonstrate that MuLo-SD achieves substantial speedups up to $mathbf1.7times$.
arXiv Detail & Related papers (2026-01-08T17:39:35Z)
Accelerate Speculative Decoding with Sparse Computation in Verification [49.74839681322316]
Speculative decoding accelerates autoregressive language model inference by verifying multiple draft tokens in parallel.<n>Existing sparsification methods are designed primarily for standard token-by-token autoregressive decoding.<n>We propose a sparse verification framework that jointly sparsifies attention, FFN, and MoE components during the verification stage to reduce the dominant computation cost.
arXiv Detail & Related papers (2025-12-26T07:53:41Z)
Eigen-1: Adaptive Multi-Agent Refinement with Monitor-Based RAG for Scientific Reasoning [53.45095336430027]
We develop a unified framework that combines implicit retrieval and structured collaboration.<n>On Humanity's Last Exam (HLE) Bio/Chem Gold, our framework achieves 48.3% accuracy.<n>Results on SuperGPQA and TRQA confirm robustness across domains.
arXiv Detail & Related papers (2025-09-25T14:05:55Z)
SIM-CoT: Supervised Implicit Chain-of-Thought [108.30049193668083]
Implicit Chain-of-Thought (CoT) methods offer a token-efficient alternative to explicit CoT reasoning in Large Language Models.<n>We identify a core latent instability issue when scaling the computational budget of implicit CoT.<n>We propose SIM-CoT, a plug-and-play training module that introduces step-level supervision to stabilize and enrich the latent reasoning space.
arXiv Detail & Related papers (2025-09-24T17:01:32Z)
Uncertainty-Masked Bernoulli Diffusion for Camouflaged Object Detection Refinement [24.522233459116354]
Camouflaged Object Detection (COD) presents inherent challenges due to subtle visual differences between targets and their backgrounds.<n>We propose the Uncertainty-Masked Bernoulli Diffusion (UMBD) model, the first generative refinement framework specifically designed for COD.<n>UMBD introduces an uncertainty-guided masking mechanism that selectively applies Bernoulli diffusion to residual regions with poor segmentation quality.
arXiv Detail & Related papers (2025-06-12T14:02:18Z)
Towards Better Code Generation: Adaptive Decoding with Uncertainty Guidance [42.737012213197865]
AdaDec is an adaptive decoding framework that employs a lookahead-based, uncertainty-aware pause-and-rerank mechanism.<n>AdaDec achieves up to 20.9% absolute gains in Pass@1 accuracy compared with greedy decoding.<n>By applying reranking only when necessary, AdaDec reduces computational overhead and latency, enhancing efficiency alongside reliability.
arXiv Detail & Related papers (2025-06-10T16:49:46Z)
Query Encoder Distillation via Embedding Alignment is a Strong Baseline Method to Boost Dense Retriever Online Efficiency [4.254906060165999]
We show that even a 2-layer, BERT-based query encoder can still retain 92.5% of the full DE performance on the BEIR benchmark. We hope that our findings will encourage the community to re-evaluate the trade-offs between method complexity and performance improvements.
arXiv Detail & Related papers (2023-06-05T06:53:55Z)
Pairwise Supervised Hashing with Bernoulli Variational Auto-Encoder and Self-Control Gradient Estimator [62.26981903551382]
Variational auto-encoders (VAEs) with binary latent variables provide state-of-the-art performance in terms of precision for document retrieval. We propose a pairwise loss function with discrete latent VAE to reward within-class similarity and between-class dissimilarity for supervised hashing. This new semantic hashing framework achieves superior performance compared to the state-of-the-arts.
arXiv Detail & Related papers (2020-05-21T06:11:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.