ConfSpec: Efficient Step-Level Speculative Reasoning via Confidence-Gated Verification
- URL: http://arxiv.org/abs/2602.18447v1
- Date: Wed, 28 Jan 2026 05:58:05 GMT
- Title: ConfSpec: Efficient Step-Level Speculative Reasoning via Confidence-Gated Verification
- Authors: Siran Liu, Cyril Y. He,
- Abstract summary: Step-level speculative reasoning aims to mitigate this cost, yet existing approaches face a long-standing trade-off.<n>We propose ConfSpec, a confidence-gated cascaded verification framework that resolves this trade-off.
- Score: 0.2578242050187029
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Chain-of-Thought reasoning significantly improves the performance of large language models on complex tasks, but incurs high inference latency due to long generation traces. Step-level speculative reasoning aims to mitigate this cost, yet existing approaches face a long-standing trade-off among accuracy, inference speed, and resource efficiency. We propose ConfSpec, a confidence-gated cascaded verification framework that resolves this trade-off. Our key insight is an asymmetry between generation and verification: while generating a correct reasoning step requires substantial model capacity, step-level verification is a constrained discriminative task for which small draft models are well-calibrated within their competence range, enabling high-confidence draft decisions to be accepted directly while selectively escalating uncertain cases to the large target model. Evaluation across diverse workloads shows that ConfSpec achieves up to 2.24$\times$ end-to-end speedups while matching target-model accuracy. Our method requires no external judge models and is orthogonal to token-level speculative decoding, enabling further multiplicative acceleration.
Related papers
- Thinking by Subtraction: Confidence-Driven Contrastive Decoding for LLM Reasoning [58.331709210563616]
Thinking by Subtraction is a confidence-driven contrastive decoding approach.<n>A small subset of low-confidence tokens disproportionately contributes to reasoning errors and unnecessary output expansion.<n>Our method, Confidence-Driven Contrastive Decoding, detects low-confidence tokens during decoding and intervenes at these positions.
arXiv Detail & Related papers (2026-02-20T14:13:22Z) - MARS: Unleashing the Power of Speculative Decoding via Margin-Aware Verification [7.935725883885573]
Speculative Decoding (SD) accelerates autoregressive large language model (LLM) inference by decoupling generation and verification.<n>We propose Margin-Aware Speculative Verification, a training-free and domain-agnostic verification strategy that adapts to the target model's local decisiveness.<n>Our method conditions verification on decision stability measured directly from the target logits and relaxes rejection only when strict verification provides minimal benefit.
arXiv Detail & Related papers (2026-01-21T22:03:06Z) - Rubric-Conditioned LLM Grading: Alignment, Uncertainty, and Robustness [4.129847064263056]
We systematically evaluate the performance of Large Language Models for rubric-based short-answer grading.<n>We find that alignment is strong for binary tasks but degrades with increased rubric granularity.<n>Experiments reveal that while the model is resilient to prompt injection, it is sensitive to synonym substitutions.
arXiv Detail & Related papers (2025-12-21T05:22:04Z) - Efficient Adaptive Rejection Sampling for Accelerating Speculative Decoding in Large Language Models [2.4065240342323384]
This paper introduces Efficient Adaptive Rejection Sampling (EARS)<n>EARS dynamically adjusts the acceptance threshold by incorporating the target model's own predictive uncertainty, measured as 1 - max(P_target)<n>It significantly enhances the efficiency of speculative decoding, achieving up to an 18.12% increase in throughput with a negligible 0.84% accuracy drop on the GSM8K benchmark.
arXiv Detail & Related papers (2025-12-15T11:08:56Z) - Arbitrage: Efficient Reasoning via Advantage-Aware Speculation [71.45710345765528]
Speculative Decoding accelerates inference by employing a fast but inaccurate draft model to autoregressively propose tokens.<n>But due to unnecessary rejections caused by token mismatches in semantically equivalent steps, traditional token-level Speculative Decoding struggles in reasoning tasks.<n>We propose Arbitrage, a novel step-level speculative generation framework that routes generation dynamically based on the relative advantage between draft and target models.
arXiv Detail & Related papers (2025-12-04T17:50:53Z) - Reasoning with Confidence: Efficient Verification of LLM Reasoning Steps via Uncertainty Heads [104.9566359759396]
We propose a lightweight alternative for step-level reasoning verification based on data-driven uncertainty scores.<n>Our findings suggest that the internal states of LLMs encode their uncertainty and can serve as reliable signals for reasoning verification.
arXiv Detail & Related papers (2025-11-09T03:38:29Z) - Sample Smart, Not Hard: Correctness-First Decoding for Better Reasoning in LLMs [72.82403830490084]
We argue that the decoding rule should be calibrated by correctness, not confidence alone.<n>We propose simple strategies that achieve this goal: Greedy-Threshold makes sampling greedy at very low confidence steps.<n>Together, our findings challenge prevailings about decoding under uncertainty and show gains across math and general reasoning benchmarks.
arXiv Detail & Related papers (2025-10-07T14:46:12Z) - TrustLoRA: Low-Rank Adaptation for Failure Detection under Out-of-distribution Data [62.22804234013273]
We propose a simple failure detection framework to unify and facilitate classification with rejection under both covariate and semantic shifts.<n>Our key insight is that by separating and consolidating failure-specific reliability knowledge with low-rank adapters, we can enhance the failure detection ability effectively and flexibly.
arXiv Detail & Related papers (2025-04-20T09:20:55Z) - Enhancing LLM Reliability via Explicit Knowledge Boundary Modeling [41.19330514054401]
Large language models (LLMs) are prone to hallucination stemming from misaligned self-awareness.<n>We propose the Explicit Knowledge Boundary Modeling framework to integrate fast and slow reasoning systems to harmonize reliability and usability.
arXiv Detail & Related papers (2025-03-04T03:16:02Z) - LoGU: Long-form Generation with Uncertainty Expressions [49.76417603761989]
We introduce the task of Long-form Generation with Uncertainty(LoGU)<n>We identify two key challenges: Uncertainty Suppression and Uncertainty Misalignment.<n>Our framework adopts a divide-and-conquer strategy, refining uncertainty based on atomic claims.<n>Experiments on three long-form instruction following datasets show that our method significantly improves accuracy, reduces hallucinations, and maintains the comprehensiveness of responses.
arXiv Detail & Related papers (2024-10-18T09:15:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.