Related papers: TriSpec: Ternary Speculative Decoding via Lightweight Proxy Verification

TriSpec: Ternary Speculative Decoding via Lightweight Proxy Verification

URL: http://arxiv.org/abs/2601.23180v1
Date: Fri, 30 Jan 2026 17:04:18 GMT
Title: TriSpec: Ternary Speculative Decoding via Lightweight Proxy Verification
Authors: Haoyun Jiang, Junqi He, Feng Hong, Xinlong Yang, Jianwei Zhang, Zheng Li, Zhengyang Zhuge, Zhiyong Chen, Bo Han, Junyang Lin, Jiangchao Yao,
Abstract summary: Speculative decoding offers significant speed-ups through its lightweight drafting and parallel verification mechanism.<n>We propose TriSpec, a novel ternary SD framework that introduces a lightweight proxy to significantly reduce computational cost.<n>Experiments on the Qwen3 and DeepSeek-R1-Distill-Qwen/LLaMA families show that TriSpec achieves up to 35% speedup over standard SD.
Score: 63.65902785448346
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Inference efficiency in Large Language Models (LLMs) is fundamentally limited by their serial, autoregressive generation, especially as reasoning becomes a key capability and response sequences grow longer. Speculative decoding (SD) offers a powerful solution, providing significant speed-ups through its lightweight drafting and parallel verification mechanism. While existing work has nearly saturated improvements in draft effectiveness and efficiency, this paper advances SD from a new yet critical perspective: the verification cost. We propose TriSpec, a novel ternary SD framework that, at its core, introduces a lightweight proxy to significantly reduce computational cost by approving easily verifiable draft sequences and engaging the full target model only when encountering uncertain tokens. TriSpec can be integrated with state-of-the-art SD methods like EAGLE-3 to further reduce verification costs, achieving greater acceleration. Extensive experiments on the Qwen3 and DeepSeek-R1-Distill-Qwen/LLaMA families show that TriSpec achieves up to 35\% speedup over standard SD, with up to 50\% fewer target model invocations while maintaining comparable accuracy.

Related papers

Quasar: Quantized Self-Speculative Acceleration for Rapid Inference via Memory-Efficient Verification [11.585310190276923]
textbfQuasar (textbfQuantized textbfSelf-speculative textbfAcceleration for textbfRapid Inference) is a training-free framework designed to overcome this "memory wall"
arXiv Detail & Related papers (2026-03-02T03:02:25Z)
Overcoming Joint Intractability with Lossless Hierarchical Speculative Decoding [58.92526489742584]
We propose provably lossless.<n> verification method that significantly boosts the expected number of accepted tokens.<n>We show that HSD yields consistent improvements in acceptance rates across diverse model families and benchmarks.
arXiv Detail & Related papers (2026-01-09T11:10:29Z)
Double: Breaking the Acceleration Limit via Double Retrieval Speculative Parallelism [19.7914286780195]
We introduce textscDouble (Double Retrieval Speculative Parallelism)<n>We enable the draft model to execute iterative retrieval speculations to break the theoretical speedup limits.<n>Experiments demonstrate state-of-the-art speedup of $textbf5.3times$ on LLaMA3.3-70B and $textbf2.8times$ on Qwen3-32B.
arXiv Detail & Related papers (2026-01-09T04:35:21Z)
VVS: Accelerating Speculative Decoding for Visual Autoregressive Generation via Partial Verification Skipping [52.58270801983525]
speculative decoding (SD) has been proven effective for accelerating visual AR models.<n>We propose a novel framework VVS to accelerate visual AR generation via partial verification skipping.
arXiv Detail & Related papers (2025-11-17T16:50:58Z)
When, What, and How: Rethinking Retrieval-Enhanced Speculative Decoding [29.402164743559]
ReSpec is a novel framework that transforms drafter switching into adaptive decision-making.<n>Experiments on Spec-Bench demonstrate that ReSpec state-of-the-art acceleration achieves over $33%$ and $25%$, respectively.
arXiv Detail & Related papers (2025-11-03T06:57:16Z)
Speculative Verification: Exploiting Information Gain to Refine Speculative Decoding [8.36763119650407]
Speculative Verification dynamically predicts speculation accuracy and adapts the verification length to maximize throughput.<n>It improves SD performance by up to 2$times$, with an average speedup of 1.4 $times$ in large-batch settings.
arXiv Detail & Related papers (2025-09-29T06:25:54Z)
Consultant Decoding: Yet Another Synergistic Mechanism [49.996656694586164]
Consultant Decoding (CD) verifies candidate drafts using token-level likelihoods computed solely by the large language model.<n>CD achieves up to a 2.5-fold increase in inference speed compared to the target model, while maintaining comparable generation quality.
arXiv Detail & Related papers (2025-06-03T03:13:27Z)
Efficient Inference for Large Language Model-based Generative Recommendation [78.38878421030522]
Large Language Model (LLM)-based generative recommendation has achieved notable success, yet its practical deployment is costly.<n>Applying Speculative Decoding (SD) to generative recommendation presents unique challenges due to the requirement of generating top-K items.<n>We propose an alignment framework named AtSpeed, which presents the AtSpeed-S optimization objective for top-K alignment under the strict top-K verification.
arXiv Detail & Related papers (2024-10-07T16:23:36Z)
Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting [68.90949377014742]
Speculative RAG is a framework that leverages a larger generalist LM to efficiently verify multiple RAG drafts produced in parallel by a smaller, distilled specialist LM.<n>Our method accelerates RAG by delegating drafting to the smaller specialist LM, with the larger generalist LM performing a single verification pass over the drafts.<n>It notably enhances accuracy by up to 12.97% while reducing latency by 50.83% compared to conventional RAG systems on PubHealth.
arXiv Detail & Related papers (2024-07-11T06:50:19Z)
Lite-FPN for Keypoint-based Monocular 3D Object Detection [18.03406686769539]
Keypoint-based monocular 3D object detection has made tremendous progress and achieved great speed-accuracy trade-off. We propose a sort of lightweight feature pyramid network called Lite-FPN to achieve multi-scale feature fusion. Our proposed method achieves significantly higher accuracy and frame rate at the same time.
arXiv Detail & Related papers (2021-05-01T14:44:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.