Related papers: Sample Smart, Not Hard: Correctness-First Decoding for Better Reasoning in LLMs

Sample Smart, Not Hard: Correctness-First Decoding for Better Reasoning in LLMs

URL: http://arxiv.org/abs/2510.05987v1
Date: Tue, 07 Oct 2025 14:46:12 GMT
Title: Sample Smart, Not Hard: Correctness-First Decoding for Better Reasoning in LLMs
Authors: Xueyan Li, Guinan Su, Mrinmaya Sachan, Jonas Geiping,
Abstract summary: We argue that the decoding rule should be calibrated by correctness, not confidence alone.<n>We propose simple strategies that achieve this goal: Greedy-Threshold makes sampling greedy at very low confidence steps.<n>Together, our findings challenge prevailings about decoding under uncertainty and show gains across math and general reasoning benchmarks.
Score: 72.82403830490084
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) are increasingly applied to complex tasks that require extended reasoning. In such settings, models often benefit from diverse chains-of-thought to arrive at multiple candidate solutions. This requires two competing objectives: to inject enough stochasticity to explore multiple reasoning chains, and to ensure sufficient accuracy and quality in each path. Existing works pursue the first objective by increasing exploration at highly uncertain steps with higher temperature or larger candidate token sets, while others improve reliability by rejecting samples with low confidence post-generation, implying that low confidence correlates with low answer quality. These two lines of thought are in conflict, as they conflate different sources of uncertainty. To resolve this, we argue that the decoding rule should be calibrated by correctness, not confidence alone. We should sample from tokens with higher estimated correctness, and reduce sampling where expected correctness is low. We propose simple strategies that achieve this goal: Greedy-Threshold makes sampling greedy at very low confidence steps. Calibrated-TopK and Calibrated-epsilon set truncation threshold based on estimated rank-wise correctness. Together, our findings challenge prevailing heuristics about decoding under uncertainty and show gains across math and general reasoning benchmarks.

Related papers

On Calibration of Large Language Models: From Response To Capability [66.59139960234326]
Large language models (LLMs) are widely deployed as general-purpose problem solvers.<n>We introduce capability calibration, which targets the model's expected accuracy on a query.<n>Our results demonstrate that capability-calibrated confidence improves pass@$k$ prediction and inference budget allocation.
arXiv Detail & Related papers (2026-02-14T01:07:45Z)
A Single Revision Step Improves Token-Efficient LLM Reasoning [3.344806691289323]
We introduce Packet-Conditioned Revision (PACER), a training-free, inference-only framework for large language models.<n>PACER enables reasoning traces to revise their conclusions through a structured coordination step.<n>On challenging competitive math benchmarks, PACER matches or exceeds the accuracy of 256-sample majority voting.
arXiv Detail & Related papers (2026-02-02T21:28:42Z)
ConfSpec: Efficient Step-Level Speculative Reasoning via Confidence-Gated Verification [0.2578242050187029]
Step-level speculative reasoning aims to mitigate this cost, yet existing approaches face a long-standing trade-off.<n>We propose ConfSpec, a confidence-gated cascaded verification framework that resolves this trade-off.
arXiv Detail & Related papers (2026-01-28T05:58:05Z)
Judging with Confidence: Calibrating Autoraters to Preference Distributions [56.17041629492863]
We argue that a reliable autorater must learn to model the full distribution of preferences defined by a target population.<n>We present two learning methods tailored to different data conditions.<n>Our results show that finetuning autoraters with a distribution-matching objective leads to verbalized probability predictions that are better aligned with the target preference distribution.
arXiv Detail & Related papers (2025-09-30T20:36:41Z)
Mind the Gap: A Closer Look at Tokenization for Multiple-Choice Question Answering with LLMs [16.357595595062946]
There is no consensus on how to tokenize the space following the colon, often overlooked as a trivial choice.<n>Surprisingly, we are able to recommend one specific strategy -- tokenizing the space together with the answer letter.<n>Our findings underscore the importance of careful evaluation design and highlight the need for standardized, transparent evaluation protocols.
arXiv Detail & Related papers (2025-09-18T14:47:58Z)
Cautious Next Token Prediction [62.74127603725369]
We propose a new training-free decoding strategy, dubbed as Cautious Next Token Prediction (CNTP)<n>In the decoding process, if the model has comparatively high prediction entropy at a certain step, we sample multiple trials starting from the step independently and stop when encountering any punctuation.<n>We show that our proposed CNTP approach outperforms existing standard decoding strategies consistently by a clear margin.
arXiv Detail & Related papers (2025-07-03T05:49:18Z)
Robust Conformal Prediction with a Single Binary Certificate [58.450154976190795]
Conformal prediction (CP) converts any model's output to prediction sets with a guarantee to cover the true label with (adjustable) high probability.<n>We propose a robust conformal prediction that produces smaller sets even with significantly lower MC samples.
arXiv Detail & Related papers (2025-03-07T08:41:53Z)
Inference Scaling fLaws: The Limits of LLM Resampling with Imperfect Verifiers [13.823743787003787]
Recent research has generated hope that inference scaling could allow weaker language models to match or exceed the accuracy of stronger models.<n>We show that no amount of inference scaling of weaker models can enable them to match the single-sample accuracy of a sufficiently strong model.<n>We also show that beyond accuracy, false positives may have other undesirable qualities, such as poor adherence to coding style conventions.
arXiv Detail & Related papers (2024-11-26T15:13:06Z)
Optimal Cross-Validation for Sparse Linear Regression [5.156484100374059]
We use k-fold cross-validation to select sparsity and robustness of linear regressors.<n>Cross-validation substantially increases the computational cost of sparse regression.<n>We improve upon this state of affairs by solving 50-80% fewer mixed-integer optimization problems.
arXiv Detail & Related papers (2023-06-26T17:02:45Z)
GRACE: Discriminator-Guided Chain-of-Thought Reasoning [75.35436025709049]
We propose Guiding chain-of-thought ReAsoning with a CorrectnEss Discriminator (GRACE) to steer the decoding process towards producing correct reasoning steps. GRACE employs a discriminator trained with a contrastive loss over correct and incorrect steps, which is used during decoding to score next-step candidates.
arXiv Detail & Related papers (2023-05-24T09:16:51Z)
Self-Evaluation Guided Beam Search for Reasoning [61.523627290397556]
We introduce a stepwise self-evaluation mechanism to guide and calibrate the reasoning process of Large Language Model (LLM) We propose a decoding algorithm integrating the self-evaluation guidance via beam search. Our approach surpasses the corresponding Codex-backboned baselines in few-shot accuracy by $6.34%$, $9.56%$, and $5.46%$ on the GSM8K, AQuA, and StrategyQA.
arXiv Detail & Related papers (2023-05-01T02:37:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.