TriSpec: Ternary Speculative Decoding via Lightweight Proxy Verification
- URL: http://arxiv.org/abs/2601.23180v1
- Date: Fri, 30 Jan 2026 17:04:18 GMT
- Title: TriSpec: Ternary Speculative Decoding via Lightweight Proxy Verification
- Authors: Haoyun Jiang, Junqi He, Feng Hong, Xinlong Yang, Jianwei Zhang, Zheng Li, Zhengyang Zhuge, Zhiyong Chen, Bo Han, Junyang Lin, Jiangchao Yao,
- Abstract summary: Speculative decoding offers significant speed-ups through its lightweight drafting and parallel verification mechanism.<n>We propose TriSpec, a novel ternary SD framework that introduces a lightweight proxy to significantly reduce computational cost.<n>Experiments on the Qwen3 and DeepSeek-R1-Distill-Qwen/LLaMA families show that TriSpec achieves up to 35% speedup over standard SD.
- Score: 63.65902785448346
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Inference efficiency in Large Language Models (LLMs) is fundamentally limited by their serial, autoregressive generation, especially as reasoning becomes a key capability and response sequences grow longer. Speculative decoding (SD) offers a powerful solution, providing significant speed-ups through its lightweight drafting and parallel verification mechanism. While existing work has nearly saturated improvements in draft effectiveness and efficiency, this paper advances SD from a new yet critical perspective: the verification cost. We propose TriSpec, a novel ternary SD framework that, at its core, introduces a lightweight proxy to significantly reduce computational cost by approving easily verifiable draft sequences and engaging the full target model only when encountering uncertain tokens. TriSpec can be integrated with state-of-the-art SD methods like EAGLE-3 to further reduce verification costs, achieving greater acceleration. Extensive experiments on the Qwen3 and DeepSeek-R1-Distill-Qwen/LLaMA families show that TriSpec achieves up to 35\% speedup over standard SD, with up to 50\% fewer target model invocations while maintaining comparable accuracy.
Related papers
- Quasar: Quantized Self-Speculative Acceleration for Rapid Inference via Memory-Efficient Verification [11.585310190276923]
textbfQuasar (textbfQuantized textbfSelf-speculative textbfAcceleration for textbfRapid Inference) is a training-free framework designed to overcome this "memory wall"
arXiv Detail & Related papers (2026-03-02T03:02:25Z) - Overcoming Joint Intractability with Lossless Hierarchical Speculative Decoding [58.92526489742584]
We propose provably lossless.<n> verification method that significantly boosts the expected number of accepted tokens.<n>We show that HSD yields consistent improvements in acceptance rates across diverse model families and benchmarks.
arXiv Detail & Related papers (2026-01-09T11:10:29Z) - Double: Breaking the Acceleration Limit via Double Retrieval Speculative Parallelism [19.7914286780195]
We introduce textscDouble (Double Retrieval Speculative Parallelism)<n>We enable the draft model to execute iterative retrieval speculations to break the theoretical speedup limits.<n>Experiments demonstrate state-of-the-art speedup of $textbf5.3times$ on LLaMA3.3-70B and $textbf2.8times$ on Qwen3-32B.
arXiv Detail & Related papers (2026-01-09T04:35:21Z) - VVS: Accelerating Speculative Decoding for Visual Autoregressive Generation via Partial Verification Skipping [52.58270801983525]
speculative decoding (SD) has been proven effective for accelerating visual AR models.<n>We propose a novel framework VVS to accelerate visual AR generation via partial verification skipping.
arXiv Detail & Related papers (2025-11-17T16:50:58Z) - When, What, and How: Rethinking Retrieval-Enhanced Speculative Decoding [29.402164743559]
ReSpec is a novel framework that transforms drafter switching into adaptive decision-making.<n>Experiments on Spec-Bench demonstrate that ReSpec state-of-the-art acceleration achieves over $33%$ and $25%$, respectively.
arXiv Detail & Related papers (2025-11-03T06:57:16Z) - Speculative Verification: Exploiting Information Gain to Refine Speculative Decoding [8.36763119650407]
Speculative Verification dynamically predicts speculation accuracy and adapts the verification length to maximize throughput.<n>It improves SD performance by up to 2$times$, with an average speedup of 1.4 $times$ in large-batch settings.
arXiv Detail & Related papers (2025-09-29T06:25:54Z) - Consultant Decoding: Yet Another Synergistic Mechanism [49.996656694586164]
Consultant Decoding (CD) verifies candidate drafts using token-level likelihoods computed solely by the large language model.<n>CD achieves up to a 2.5-fold increase in inference speed compared to the target model, while maintaining comparable generation quality.
arXiv Detail & Related papers (2025-06-03T03:13:27Z) - Efficient Inference for Large Language Model-based Generative Recommendation [78.38878421030522]
Large Language Model (LLM)-based generative recommendation has achieved notable success, yet its practical deployment is costly.<n>Applying Speculative Decoding (SD) to generative recommendation presents unique challenges due to the requirement of generating top-K items.<n>We propose an alignment framework named AtSpeed, which presents the AtSpeed-S optimization objective for top-K alignment under the strict top-K verification.
arXiv Detail & Related papers (2024-10-07T16:23:36Z) - Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting [68.90949377014742]
Speculative RAG is a framework that leverages a larger generalist LM to efficiently verify multiple RAG drafts produced in parallel by a smaller, distilled specialist LM.<n>Our method accelerates RAG by delegating drafting to the smaller specialist LM, with the larger generalist LM performing a single verification pass over the drafts.<n>It notably enhances accuracy by up to 12.97% while reducing latency by 50.83% compared to conventional RAG systems on PubHealth.
arXiv Detail & Related papers (2024-07-11T06:50:19Z) - Lite-FPN for Keypoint-based Monocular 3D Object Detection [18.03406686769539]
Keypoint-based monocular 3D object detection has made tremendous progress and achieved great speed-accuracy trade-off.
We propose a sort of lightweight feature pyramid network called Lite-FPN to achieve multi-scale feature fusion.
Our proposed method achieves significantly higher accuracy and frame rate at the same time.
arXiv Detail & Related papers (2021-05-01T14:44:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.