S$^4$C: Speculative Sampling with Syntactic and Semantic Coherence for Efficient Inference of Large Language Models
- URL: http://arxiv.org/abs/2506.14158v1
- Date: Tue, 17 Jun 2025 03:38:19 GMT
- Title: S$^4$C: Speculative Sampling with Syntactic and Semantic Coherence for Efficient Inference of Large Language Models
- Authors: Tao He, Guang Huang, Yu Yang, Tianshi Xu, Sicheng Zhao, Guiguang Ding, Pengyang Wang, Feng Tian,
- Abstract summary: Large language models (LLMs) exhibit remarkable reasoning capabilities across diverse downstream tasks.<n>Their autoregressive nature leads to substantial latency inference, posing challenges for real-time applications.<n>We propose a Speculative Sampling with Syntactic and Semantic Coherence framework, which extends speculative sampling by leveraging multi-head drafting.
- Score: 38.784951111677856
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) exhibit remarkable reasoning capabilities across diverse downstream tasks. However, their autoregressive nature leads to substantial inference latency, posing challenges for real-time applications. Speculative sampling mitigates this issue by introducing a drafting phase followed by a parallel validation phase, enabling faster token generation and verification. Existing approaches, however, overlook the inherent coherence in text generation, limiting their efficiency. To address this gap, we propose a Speculative Sampling with Syntactic and Semantic Coherence (S$^4$C) framework, which extends speculative sampling by leveraging multi-head drafting for rapid token generation and a continuous verification tree for efficient candidate validation and feature reuse. Experimental results demonstrate that S$^4$C surpasses baseline methods across mainstream tasks, offering enhanced efficiency, parallelism, and the ability to generate more valid tokens with fewer computational resources. On Spec-bench benchmarks, S$^4$C achieves an acceleration ratio of 2.26x-2.60x, outperforming state-of-the-art methods.
Related papers
- Sample Complexity and Representation Ability of Test-time Scaling Paradigms [91.34339030453425]
Test-time scaling paradigms have advanced the capabilities of large language models (LLMs) on complex tasks.<n>We study the sample efficiency of various test-time strategies, such as self-consistency, best-of-$n$, and self-correction.<n>A single Transformer architecture can provably solve multiple tasks without prior knowledge of the specific task associated with a user query.
arXiv Detail & Related papers (2025-06-05T17:48:19Z) - Accelerated Test-Time Scaling with Model-Free Speculative Sampling [58.69141724095398]
We introduce STAND (STochastic Adaptive N-gram Drafting), a novel model-free speculative decoding approach.<n>We show that STAND reduces inference latency by 60-65% compared to standard autoregressive decoding.<n>As a model-free approach, STAND can be applied to any existing language model without additional training.
arXiv Detail & Related papers (2025-06-05T07:31:18Z) - Fractured Chain-of-Thought Reasoning [61.647243580650446]
We introduce Fractured Sampling, a unified inference-time strategy that interpolates between full CoT and solution-only sampling.<n>We show that Fractured Sampling consistently achieves superior accuracy-cost trade-offs, yielding steep log-linear scaling gains in Pass@k versus token budget.
arXiv Detail & Related papers (2025-05-19T11:30:41Z) - Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling [90.86991492288487]
evaluating constraint on every token can be prohibitively expensive.<n> LCD can distort the global distribution over strings, sampling tokens based only on local information.<n>We show that our approach is superior to state-of-the-art baselines.
arXiv Detail & Related papers (2025-04-07T18:30:18Z) - Collaborative Speculative Inference for Efficient LLM Inference Serving [22.3288901160418]
CoSine is a novel speculative inference system that decouples sequential speculative decoding from parallel verification.<n>With equivalent resource costs, CoSine achieves up to a 23.2% decrease in latency and a 32.5% increase in throughput compared to baseline methods.
arXiv Detail & Related papers (2025-03-13T13:03:38Z) - COrAL: Order-Agnostic Language Modeling for Efficient Iterative Refinement [80.18490952057125]
Iterative refinement has emerged as an effective paradigm for enhancing the capabilities of large language models (LLMs) on complex tasks.
We propose Context-Wise Order-Agnostic Language Modeling (COrAL) to overcome these challenges.
Our approach models multiple token dependencies within manageable context windows, enabling the model to perform iterative refinement internally.
arXiv Detail & Related papers (2024-10-12T23:56:19Z) - Expediting and Elevating Large Language Model Reasoning via Hidden Chain-of-Thought Decoding [14.175444025026508]
Large language models (LLMs) have demonstrated remarkable capabilities in tasks requiring chain-of-thought (CoT) prompting.
generating the full CoT process results in significantly longer output sequences, leading to increased computational costs and latency during inference.
We propose a novel approach to compress the CoT process through semantic alignment, enabling more efficient decoding while preserving the benefits of CoT reasoning.
arXiv Detail & Related papers (2024-09-13T06:29:20Z) - Path-Consistency: Prefix Enhancement for Efficient Inference in LLM [3.309813585671485]
textitpath-consistency mitigates both the errors and redundancies from random or less useful sampling in self-consistency.<n>textitpath-consistency achieves significant acceleration in inference latency ranging from $7.8%$ to $40.5%$.
arXiv Detail & Related papers (2024-08-25T01:45:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.