Related papers: Latent Veracity Inference for Identifying Errors in Stepwise Reasoning

Latent Veracity Inference for Identifying Errors in Stepwise Reasoning

URL: http://arxiv.org/abs/2505.11824v2
Date: Fri, 26 Sep 2025 03:18:43 GMT
Title: Latent Veracity Inference for Identifying Errors in Stepwise Reasoning
Authors: Minsu Kim, Jean-Pierre Falet, Oliver E. Richardson, Xiaoyin Chen, Moksh Jain, Sungjin Ahn, Sungsoo Ahn, Yoshua Bengio,
Abstract summary: We introduce Veracity Search (VS), a discrete search algorithm over veracity assignments.<n>It performs otherwise intractable inference in the posterior distribution over latent veracity values.<n>It generalizes VS, enabling accurate zero-shot veracity inference in novel contexts.
Score: 78.29317733206643
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Chain-of-Thought (CoT) reasoning has advanced the capabilities and transparency of language models (LMs); however, reasoning chains can contain inaccurate statements that reduce performance and trustworthiness. To address this, we propose to augment each reasoning step in a CoT with a latent veracity (or correctness) variable. To efficiently explore this expanded space, we introduce Veracity Search (VS), a discrete search algorithm over veracity assignments. It performs otherwise intractable inference in the posterior distribution over latent veracity values by leveraging the LM's joint likelihood over veracity and the final answer as a proxy reward. This efficient inference-time verification method facilitates supervised fine-tuning of an Amortized Veracity Inference (AVI) machine by providing pseudo-labels for veracity. AVI generalizes VS, enabling accurate zero-shot veracity inference in novel contexts. Empirical results demonstrate that VS reliably identifies errors in logical (ProntoQA), mathematical (GSM8K), and commonsense (CommonsenseQA) reasoning benchmarks, with AVI achieving comparable zero-shot accuracy. Finally, we demonstrate the utility of latent veracity inference for providing feedback during self-correction and self-improvement.

Related papers

Balancing Faithfulness and Performance in Reasoning via Multi-Listener Soft Execution [79.98699884805636]
Reasoning Execution by Multiple Listeners (REMUL) is a multi-party reinforcement learning approach.<n>REMUL builds on the hypothesis that reasoning traces which other parties can follow will be more faithful.<n>Speakers are rewarded for producing reasoning that is clear to listeners.
arXiv Detail & Related papers (2026-02-18T02:55:55Z)
Stochastic CHAOS: Why Deterministic Inference Kills, and Distributional Variability Is the Heartbeat of Artifical Cognition [14.945980804235885]
We argue that, for LLMs, deterministic inference kills.<n>It kills the ability to model uncertainty, suppresses emergent abilities, collapses reasoning into a single brittle path, and weakens safety alignment by hiding tail risks.
arXiv Detail & Related papers (2026-01-12T06:19:09Z)
Illusions of Confidence? Diagnosing LLM Truthfulness via Neighborhood Consistency [78.91846841708586]
We show that even facts answered with perfect self-consistency can rapidly collapse under mild contextual interference.<n>We propose Neighbor-Consistency Belief (NCB), a structural measure of belief that evaluates response coherence across a conceptual neighborhood.<n>We also present Structure-Aware Training (SAT), which optimize context-invariant belief structure and reduces long-tail knowledge brittleness by approximately 30%.
arXiv Detail & Related papers (2026-01-09T16:23:21Z)
Uncertainty as Feature Gaps: Epistemic Uncertainty Quantification of LLMs in Contextual Question-Answering [29.4458902836278]
We introduce a task-agnostic, token-level uncertainty measure defined as the cross-entropy between the predictive distribution of the given model and the unknown true distribution.<n>We derive an upper bound for uncertainty and show that it can be interpreted as semantic feature gaps in the given model's hidden representations.<n>We apply this generic framework to the contextual QA task and hypothesize that three features approximate this gap: context-reliance, context comprehension, and honesty.
arXiv Detail & Related papers (2025-10-03T02:09:25Z)
VeriLLM: A Lightweight Framework for Publicly Verifiable Decentralized Inference [3.8760740008451156]
We introduce VeriLLM, a publicly verifiable protocol for decentralized language models (LLMs) inference.<n>VeriLLM combines lightweight empirical rerunning with cryptographic commitments, allowing verifiers to validate results at approximately 1% of the underlying inference cost.<n>We show that VeriLLM achieves reliable public verifiability with minimal overhead.
arXiv Detail & Related papers (2025-09-29T04:07:32Z)
Semantic Voting: A Self-Evaluation-Free Approach for Efficient LLM Self-Improvement on Unverifiable Open-ended Tasks [38.058215007885096]
Self-evaluation for large language models (LLMs) incurs high computational overhead and introduces overconfidence issues due to intrinsic biases.<n>We propose a novel self-evaluation-free approach for unverifiable tasks, designed for lightweight yet effective self-improvement.
arXiv Detail & Related papers (2025-09-27T02:44:05Z)
Probabilistic Soundness Guarantees in LLM Reasoning Chains [39.228405100824695]
Autoregressive Reasoning Entailment Stability (ARES) is a novel probabilistic framework that prevents error propagation by judging each claim based only on previously-assessed sound premises.<n>ARES achieves state-of-the-art performance across four benchmarks and demonstrates superior robustness on very long synthetic reasoning chains.
arXiv Detail & Related papers (2025-07-17T09:40:56Z)
Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens [14.78605805191225]
We investigate how the semantics of intermediate tokens-often anthropomorphized as "thoughts" or reasoning traces-actually influence model performance.<n>We show that despite significant improvements on the solution-only baseline, models trained on entirely correct traces still produce invalid reasoning traces when arriving at correct solutions.
arXiv Detail & Related papers (2025-05-19T23:29:23Z)
Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification [52.095460362197336]
Large language models (LLMs) struggle with consistent and accurate reasoning. LLMs are trained primarily on correct solutions, reducing their ability to detect and learn from errors. We propose a novel collaborative method integrating Chain-of-Thought (CoT) and Program-of-Thought (PoT) solutions for verification.
arXiv Detail & Related papers (2024-10-05T05:21:48Z)
Learning to Check: Unleashing Potentials for Self-Correction in Large Language Models [5.463333911506443]
We aim to enhance the self-checking capabilities of large language models (LLMs) by constructing training data for checking tasks. We propose a specialized checking format called "Step CoT Check" Experiments demonstrate that fine-tuning with the "Step CoT Check" format significantly improves the self-checking and self-correction abilities of LLMs.
arXiv Detail & Related papers (2024-02-20T14:23:23Z)
A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains [33.46649770312231]
Prompting language models to provide step-by-step answers is a prominent approach for complex reasoning tasks. No fine-grained step-level datasets are available to enable thorough evaluation of such verification methods. We introduce REVEAL: Reasoning Verification Evaluation, a dataset to benchmark automatic verifiers of complex Chain-of-Thought reasoning.
arXiv Detail & Related papers (2024-02-01T12:46:45Z)
Understanding and Mitigating Classification Errors Through Interpretable Token Patterns [58.91023283103762]
Characterizing errors in easily interpretable terms gives insight into whether a classifier is prone to making systematic errors. We propose to discover those patterns of tokens that distinguish correct and erroneous predictions. We show that our method, Premise, performs well in practice.
arXiv Detail & Related papers (2023-11-18T00:24:26Z)
Goodhart's Law Applies to NLP's Explanation Benchmarks [57.26445915212884]
We critically examine two sets of metrics: the ERASER metrics (comprehensiveness and sufficiency) and the EVAL-X metrics. We show that we can inflate a model's comprehensiveness and sufficiency scores dramatically without altering its predictions or explanations on in-distribution test inputs. Our results raise doubts about the ability of current metrics to guide explainability research, underscoring the need for a broader reassessment of what precisely these metrics are intended to capture.
arXiv Detail & Related papers (2023-08-28T03:03:03Z)
RCOT: Detecting and Rectifying Factual Inconsistency in Reasoning by Reversing Chain-of-Thought [56.558892336235914]
Reversing Chain-of-Thought (RCoT) is a novel method to improve large language models' reasoning abilities. RCoT automatically detects and rectifys factual inconsistency in generated solutions. We show that manually written fine-grained feedback can dramatically improve LLMs' reasoning abilities.
arXiv Detail & Related papers (2023-05-19T08:02:52Z)
Zero-shot Faithful Factual Error Correction [53.121642212060536]
Faithfully correcting factual errors is critical for maintaining the integrity of textual knowledge bases and preventing hallucinations in sequence-to-sequence models. We present a zero-shot framework that formulates questions about input claims, looks for correct answers in the given evidence, and assesses the faithfulness of each correction based on its consistency with the evidence.
arXiv Detail & Related papers (2023-05-13T18:55:20Z)
Converge to the Truth: Factual Error Correction via Iterative Constrained Editing [30.740281040892086]
We propose VENCE, a novel method for factual error correction (FEC) with minimal edits. VENCE formulates the FEC problem as iterative sampling editing actions with respect to a target density function. Experiments on a public dataset show that VENCE improves the well-adopted SARI metric by 5.3 (or a relative improvement of 11.8%) over the previous best distantly-supervised methods.
arXiv Detail & Related papers (2022-11-22T10:03:13Z)
Robustness and Accuracy Could Be Reconcilable by (Proper) Definition [109.62614226793833]
The trade-off between robustness and accuracy has been widely studied in the adversarial literature. We find that it may stem from the improperly defined robust error, which imposes an inductive bias of local invariance. By definition, SCORE facilitates the reconciliation between robustness and accuracy, while still handling the worst-case uncertainty.
arXiv Detail & Related papers (2022-02-21T10:36:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.