PRISM: Parametrically Refactoring Inference for Speculative Sampling Draft Models
- URL: http://arxiv.org/abs/2602.01762v1
- Date: Mon, 02 Feb 2026 07:46:03 GMT
- Title: PRISM: Parametrically Refactoring Inference for Speculative Sampling Draft Models
- Authors: Xuliang Wang, Yuetao Chen, Maochan Zhen, Fang Liu, Xinzhou Zheng, Xingwu Liu, Hong Xu, Ming Li,
- Abstract summary: Large Language Models (LLMs), constrained by their auto-regressive nature, suffer from slow decoding.<n>We propose PRISM, which disaggregates the computation of each predictive step across different parameter sets.<n>We show that PRISM boosts the decoding throughput of an already highly optimized inference engine by more than 2.6x.
- Score: 7.294980307522086
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs), constrained by their auto-regressive nature, suffer from slow decoding. Speculative decoding methods have emerged as a promising solution to accelerate LLM decoding, attracting attention from both systems and AI research communities. Recently, the pursuit of better draft quality has driven a trend toward parametrically larger draft models, which inevitably introduces substantial computational overhead. While existing work attempts to balance the trade-off between prediction accuracy and compute latency, we address this fundamental dilemma through architectural innovation. We propose PRISM, which disaggregates the computation of each predictive step across different parameter sets, refactoring the computational pathways of draft models to successfully decouple model capacity from inference cost. Through extensive experiments, we demonstrate that PRISM outperforms all existing draft architectures, achieving exceptional acceptance lengths while maintaining minimal draft latency for superior end-to-end speedup. We also re-examine scaling laws with PRISM, revealing that PRISM scales more effectively with expanding data volumes than other draft architectures. Through rigorous and fair comparison, we show that PRISM boosts the decoding throughput of an already highly optimized inference engine by more than 2.6x.
Related papers
- Towards Robust Scaling Laws for Optimizers [89.21160945066737]
Empirical scaling laws are widely used to predict loss as model size and training data grow.<n>We show that Chinchilla-style scaling laws emerge naturally as a result of loss decomposition into irreducible, approximation, and optimization errors.
arXiv Detail & Related papers (2026-02-07T21:40:33Z) - Scaling LLM Speculative Decoding: Non-Autoregressive Forecasting in Large-Batch Scenarios [76.85739138203014]
We present SpecFormer, a novel architecture that accelerates unidirectional and attention mechanisms.<n>We demonstrate that SpecFormer achieves lower training demands and reduced computational costs.
arXiv Detail & Related papers (2025-11-25T14:20:08Z) - Nemotron-Flash: Towards Latency-Optimal Hybrid Small Language Models [97.55009021098554]
This work aims to identify the key determinants of SLMs' real-device latency and offer generalizable principles and methodologies for SLM design and training.<n>We introduce a new family of hybrid SLMs, called Nemotron-Flash, which significantly advances the accuracy-efficiency frontier of state-of-the-art SLMs.
arXiv Detail & Related papers (2025-11-24T08:46:36Z) - Fast Inference via Hierarchical Speculative Decoding [65.40448210801763]
We introduce Hierarchical Speculative Decoding (HSD), an algorithm that stacks draft models into a hierarchy, where each model proposes tokens, and the next larger model verifies them in a single forward pass.<n>HSD gives up to 1.2x speed-up over the best single-draft baseline.
arXiv Detail & Related papers (2025-10-22T15:56:19Z) - READER: Retrieval-Assisted Drafter for Efficient LLM Inference [0.0386965802948046]
Autoregressive Language Models instantiate a factorized likelihood over token sequences, yet their strictly sequential decoding process imposes an intrinsic lower bound on latency inference.<n>This bottleneck has emerged as a central obstacle to the scalable deployment of large-scale generative models.<n>We present READER, a speculative decoding framework that bypasses the training of the auxiliary draft model.
arXiv Detail & Related papers (2025-08-12T16:47:48Z) - Hierarchical Verification of Speculative Beams for Accelerating LLM Inference [0.0]
Hierarchical Verification Tree (HVT) is a novel framework that restructures speculative beam decoding by prioritizing high-likelihood drafts.<n>HVT consistently outperforms existing speculative decoding schemes, achieving substantial reductions in inference time and energy consumption.<n>Findings highlight the potential of hierarchical verification strategies as a new direction for accelerating large language model inference.
arXiv Detail & Related papers (2025-07-30T02:58:03Z) - LARES: Latent Reasoning for Sequential Recommendation [96.26996622771593]
We present LARES, a novel and scalable LAtent REasoning framework for Sequential recommendation.<n>Our proposed approach employs a recurrent architecture that allows flexible expansion of reasoning depth without increasing parameter complexity.<n>Our framework exhibits seamless compatibility with existing advanced models, further improving their recommendation performance.
arXiv Detail & Related papers (2025-05-22T16:22:54Z) - Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [61.00662702026523]
We propose a unified Test-Time Compute scaling framework that leverages increased inference-time instead of larger models.<n>Our framework incorporates two complementary strategies: internal TTC and external TTC.<n>We demonstrate our textbf32B model achieves a 46% issue resolution rate, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1.
arXiv Detail & Related papers (2025-03-31T07:31:32Z) - The Efficiency vs. Accuracy Trade-off: Optimizing RAG-Enhanced LLM Recommender Systems Using Multi-Head Early Exit [46.37267466656765]
This paper presents an optimization framework that combines Retrieval-Augmented Generation (RAG) with an innovative multi-head early exit architecture.<n>Our experiments demonstrate how this architecture effectively decreases time without sacrificing the accuracy needed for reliable recommendation delivery.
arXiv Detail & Related papers (2025-01-04T03:26:46Z) - Closer Look at Efficient Inference Methods: A Survey of Speculative Decoding [1.3479499607624648]
Speculative decoding addresses bottleneck by introducing a two-stage framework: drafting and verification.<n>A smaller, efficient model generates a preliminary draft, which is then refined by a larger, more sophisticated model.<n>This paper provides a comprehensive survey of speculative decoding methods, categorizing them into draft-centric and model-centric approaches.
arXiv Detail & Related papers (2024-11-20T09:46:30Z) - Graph-Structured Speculative Decoding [52.94367724136063]
Speculative decoding has emerged as a promising technique to accelerate the inference of Large Language Models.
We introduce an innovative approach utilizing a directed acyclic graph (DAG) to manage the drafted hypotheses.
We observe a remarkable speedup of 1.73$times$ to 1.96$times$, significantly surpassing standard speculative decoding.
arXiv Detail & Related papers (2024-07-23T06:21:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.