Related papers: One-Token Verification for Reasoning Correctness Estimation

One-Token Verification for Reasoning Correctness Estimation

URL: http://arxiv.org/abs/2603.01025v1
Date: Sun, 01 Mar 2026 10:09:58 GMT
Title: One-Token Verification for Reasoning Correctness Estimation
Authors: Zhan Zhuang, Xiequn Wang, Zebin Chen, Feiyang Ye, Ying Wei, Kede Ma, Yu Zhang,
Abstract summary: One-Token Verification (OTV) is a computational method that estimates reasoning correctness in a single forward pass during generation.<n>OTV consistently surpasses existing verifiers and reduces token usage by up to $90%$ through correctness-guided early termination.
Score: 31.590898058475464
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent breakthroughs in large language models (LLMs) have led to notable successes in complex reasoning tasks, such as mathematical problem solving. A common strategy for improving performance is parallel thinking, in which multiple reasoning traces are generated and the final prediction is made using aggregation schemes like majority voting or best-of-$N$ decoding. However, two key challenges persist. First, multi-sample decoding incurs substantial inference latency, especially for long-form outputs. Second, effective mechanisms for reliably assessing the correctness of individual reasoning traces are still limited. To address these challenges, we introduce One-Token Verification (OTV), a computational method that estimates reasoning correctness in a single forward pass during generation. OTV is activated by a learnable token and integrated into the LLM via low-rank adaptation to probe internal reasoning signals through the key-value cache, supporting token-level correctness estimation at any stage of generation without disrupting primary reasoning. Experiments on mathematical reasoning benchmarks demonstrate that OTV consistently surpasses existing verifiers. Additionally, OTV reduces token usage by up to $90\%$ through correctness-guided early termination, prioritizing shorter, more reliable solutions.

Related papers

Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens [12.788799173865]
We quantify inference-time effort by identifying deep-thinking tokens.<n>Think@n is a test-time scaling strategy that prioritizes samples with high deep-thinking ratios.
arXiv Detail & Related papers (2026-02-13T23:07:37Z)
ENTRA: Entropy-Based Redundancy Avoidance in Large Language Model Reasoning [30.786062954495403]
Large Reasoning Models (LRMs) often suffer from overthinking, generating unnecessarily long reasoning chains even for simple tasks.<n>We propose ENTRA, an entropy-based training framework that suppresses redundant reasoning while preserving performance.
arXiv Detail & Related papers (2026-01-12T01:26:30Z)
Arbitrage: Efficient Reasoning via Advantage-Aware Speculation [71.45710345765528]
Speculative Decoding accelerates inference by employing a fast but inaccurate draft model to autoregressively propose tokens.<n>But due to unnecessary rejections caused by token mismatches in semantically equivalent steps, traditional token-level Speculative Decoding struggles in reasoning tasks.<n>We propose Arbitrage, a novel step-level speculative generation framework that routes generation dynamically based on the relative advantage between draft and target models.
arXiv Detail & Related papers (2025-12-04T17:50:53Z)
TokenSqueeze: Performance-Preserving Compression for Reasoning LLMs [57.217593337454026]
TokenSqueeze is a novel Long2Short method that condenses reasoning paths while preserving performance and relying exclusively on self-generated data.<n>We show that TokenSqueeze reduces token usage while maintaining accuracy on the MATH500 benchmark.
arXiv Detail & Related papers (2025-11-17T10:38:56Z)
In-Token Rationality Optimization: Towards Accurate and Concise LLM Reasoning via Self-Feedback [38.915062716409686]
InTRO is a new framework that enables both token-level exploration and self-feedback for accurate and concise reasoning.<n>InTRO consistently outperforms other baselines, raising solution accuracy by up to 20% relative to the base model.<n>Its chains of thought are notably more concise, exhibiting reduced verbosity.
arXiv Detail & Related papers (2025-11-13T01:47:06Z)
Pruning the Unsurprising: Efficient Code Reasoning via First-Token Surprisal [13.035073453917088]
Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in code reasoning by scaling up the length of Chain-of-Thought (CoT)<n>We propose ASAP (Anchor-guided, Surprisal-based Pruning), a novel coarse-to-fine framework for CoT compression.<n> ASAP achieves state-of-the-art accuracy across multiple code generation benchmarks while substantially reducing training and inference costs.
arXiv Detail & Related papers (2025-08-08T03:46:21Z)
Fractured Chain-of-Thought Reasoning [61.647243580650446]
We introduce Fractured Sampling, a unified inference-time strategy that interpolates between full CoT and solution-only sampling.<n>We show that Fractured Sampling consistently achieves superior accuracy-cost trade-offs, yielding steep log-linear scaling gains in Pass@k versus token budget.
arXiv Detail & Related papers (2025-05-19T11:30:41Z)
O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning [98.3430004984531]
We propose Length-Harmonizing Fine-Tuning (O1-Pruner) to minimize reasoning overhead while maintaining accuracy.<n>Our code is coming soon at https://github.com/StarDewXXX/O1-Pruner.
arXiv Detail & Related papers (2025-01-22T01:35:11Z)
Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification [52.095460362197336]
Large language models (LLMs) struggle with consistent and accurate reasoning. LLMs are trained primarily on correct solutions, reducing their ability to detect and learn from errors. We propose a novel collaborative method integrating Chain-of-Thought (CoT) and Program-of-Thought (PoT) solutions for verification.
arXiv Detail & Related papers (2024-10-05T05:21:48Z)
Confident Adaptive Language Modeling [95.45272377648773]
CALM is a framework for dynamically allocating different amounts of compute per input and generation timestep. We demonstrate the efficacy of our framework in reducing compute -- potential speedup of up to $times 3$ -- while provably maintaining high performance.
arXiv Detail & Related papers (2022-07-14T17:00:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.