A Single Revision Step Improves Token-Efficient LLM Reasoning
- URL: http://arxiv.org/abs/2602.02828v1
- Date: Mon, 02 Feb 2026 21:28:42 GMT
- Title: A Single Revision Step Improves Token-Efficient LLM Reasoning
- Authors: Yingchuan Zhang, Terry Ma, Wenxuan Zhong, Ping Ma,
- Abstract summary: We introduce Packet-Conditioned Revision (PACER), a training-free, inference-only framework for large language models.<n>PACER enables reasoning traces to revise their conclusions through a structured coordination step.<n>On challenging competitive math benchmarks, PACER matches or exceeds the accuracy of 256-sample majority voting.
- Score: 3.344806691289323
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) achieve higher accuracy on challenging reasoning tasks by scaling test-time compute through multiple trajectory sampling. However, standard aggregation methods like majority voting or individual confidence-based filtering face a fundamental "blind spot": they evaluate each trace in isolation. As problems scale in difficulty, models often generate hallucinated paths that exhibit misleadingly high confidence, causing the true solution to be suppressed by a narrow margin in traditional voting. We ask: can we enable traces to "peer-review" each other to resolve these near-miss errors? We introduce Packet-Conditioned Revision (PACER), a training-free, inference-only framework that enables reasoning traces to revise their conclusions through a structured coordination step. After a preliminary screening of generated traces, PACER constructs a compact consensus packet containing (i) unique candidate answers, (ii) their aggregated confidence scores, and (iii) representative reasoning summaries for each candidate answer. Individual traces then perform a targeted self-review conditioned on this packet, allowing them to identify specific logical junctions where they diverged from the broader consensus and pivot if their original reasoning is found to be flawed. Final predictions are obtained via confidence-weighted voting over these revised trajectories. On challenging competitive math benchmarks such as AIME and BRUMO, PACER matches or exceeds the accuracy of 256-sample majority voting, significantly outperforming raw ensemble baselines by transforming simple consensus into a collaborative logical refinement process.
Related papers
- Beyond Raw Detection Scores: Markov-Informed Calibration for Boosting Machine-Generated Text Detection [105.14032334647932]
Machine-generated texts (MGTs) pose risks such as disinformation and phishing, highlighting the need for reliable detection.<n> Metric-based methods, which extract statistically distinguishable features of MGTs, are often more practical than complex model-based methods that are prone to overfitting.<n>We propose a Markov-informed score calibration strategy that models two relationships of context detection scores that may aid calibration.
arXiv Detail & Related papers (2026-02-08T16:06:12Z) - VERGE: Formal Refinement and Guidance Engine for Verifiable LLM Reasoning [4.3414302048068745]
We present a neurosymbolic framework that combines Large Language Models with SMT solvers to produce verification-guided answers.<n>We introduce three key innovations: (1) multi-model consensus via formal semantic equivalence checking, (2) semantic routing that directs different claim types to appropriate verification strategies, and (3) precise logical error localization via Minimal Correction Subsets.<n>With the GPT-OSS-120B model, VERGE demonstrates an average performance uplift of 18.7% at convergence across a set of reasoning benchmarks compared to single-pass approaches.
arXiv Detail & Related papers (2026-01-27T20:59:11Z) - CoT-Seg: Rethinking Segmentation with Chain-of-Thought Reasoning and Self-Correction [50.67483317563736]
This paper aims to explore a system that can think step-by-step, look up information if needed, generate results, self-evaluate its own results, and refine the results.<n>We introduce CoT-Seg, a training-free framework that rethinks reasoning segmentation by combining chain-of-thought reasoning with self-correction.
arXiv Detail & Related papers (2026-01-24T11:41:54Z) - Beyond Majority Voting: Towards Fine-grained and More Reliable Reward Signal for Test-Time Reinforcement Learning [12.354777054071379]
Test-time reinforcement learning mitigates reliance on annotated data by using majority voting results as pseudo-labels.<n>This voting strategy often induces confirmation bias and suffers from sparse rewards, limiting the overall performance.<n>We propose subgroup-specific step-wise confidence-weighted pseudo-label estimation (SCOPE) to address these issues.
arXiv Detail & Related papers (2025-12-17T07:21:54Z) - TRACE: A Framework for Analyzing and Enhancing Stepwise Reasoning in Vision-Language Models [9.607579442309639]
We introduce TRACE, a framework for Transparent Reasoning And Consistency Evaluation.<n>At its core, TRACEleverages Auxiliary Reasoning Sets to decompose complex problems.<n>Our experiments show that consistency across ARS correlates with final-answer correctness.<n>TRACE defines confidence regions that distinguish reliable from unreliable reasoning paths.
arXiv Detail & Related papers (2025-12-05T18:40:18Z) - In-Token Rationality Optimization: Towards Accurate and Concise LLM Reasoning via Self-Feedback [38.915062716409686]
InTRO is a new framework that enables both token-level exploration and self-feedback for accurate and concise reasoning.<n>InTRO consistently outperforms other baselines, raising solution accuracy by up to 20% relative to the base model.<n>Its chains of thought are notably more concise, exhibiting reduced verbosity.
arXiv Detail & Related papers (2025-11-13T01:47:06Z) - Sample Smart, Not Hard: Correctness-First Decoding for Better Reasoning in LLMs [72.82403830490084]
We argue that the decoding rule should be calibrated by correctness, not confidence alone.<n>We propose simple strategies that achieve this goal: Greedy-Threshold makes sampling greedy at very low confidence steps.<n>Together, our findings challenge prevailings about decoding under uncertainty and show gains across math and general reasoning benchmarks.
arXiv Detail & Related papers (2025-10-07T14:46:12Z) - Critique to Verify: Accurate and Honest Test-Time Scaling with RL-Trained Verifiers [63.99316853136304]
Mirror-Critique is a framework that trains a verifier with informative critiques.<n>We deploy a small instruction-tuned model to synthesize high-quality critique data.<n>The resulting Mirror-Verifier is deployed to evaluate candidate solutions by generating multiple critiques per solution.
arXiv Detail & Related papers (2025-09-27T06:50:24Z) - CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward [50.97588334916863]
We develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward.<n>It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types.<n>We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier.
arXiv Detail & Related papers (2025-08-05T17:55:24Z) - Project-Probe-Aggregate: Efficient Fine-Tuning for Group Robustness [61.45587642780908]
We propose a three-step approach for parameter-efficient fine-tuning of image-text foundation models.<n>Our method improves its two key components: minority samples identification and the robust training algorithm.<n>Our theoretical analysis shows that our PPA enhances minority group identification and is Bayes optimal for minimizing the balanced group error.
arXiv Detail & Related papers (2025-03-12T15:46:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.