Related papers: Audit-of-Understanding: Posterior-Constrained Inference for Mathematical Reasoning in Language Models

Audit-of-Understanding: Posterior-Constrained Inference for Mathematical Reasoning in Language Models

URL: http://arxiv.org/abs/2510.10252v2
Date: Sat, 18 Oct 2025 10:20:04 GMT
Title: Audit-of-Understanding: Posterior-Constrained Inference for Mathematical Reasoning in Language Models
Authors: Samir Abdaljalil, Erchin Serpedin, Khalid Qaraqe, Hasan Kurban,
Abstract summary: We propose Audit-of-Understanding (AoU), a framework that constrains inference to validated premises through three phases.<n>AoU is emphposterior-constrained inference, connecting to selective prediction and rejection learning.<n>Our contributions are threefold: (i) theoretical guarantees under perfect validation, (ii) excess-risk bounds under imperfect audits, and (iii) tractability analysis.
Score: 2.453830698820308
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) often generate reasoning traces that appear coherent but rest on unsupported assumptions, leading to hallucinated conclusions. Prior work mainly addresses factual hallucinations or relies on post-hoc verification, leaving reasoning-induced hallucinations largely unaddressed. We propose Audit-of-Understanding (AoU), a framework that constrains inference to validated premises through three phases: (1) decomposing a query into candidate assumptions, (2) auditing their support, and (3) conditioning inference only on the validated subset. Formally, AoU is \emph{posterior-constrained inference}, connecting to selective prediction and rejection learning. Our contributions are threefold: (i) theoretical guarantees under perfect validation, (ii) excess-risk bounds under imperfect audits, and (iii) tractability analysis. Empirically, AoU improves both accuracy and faithfulness on GSM8K, MultiArith, and SVAMP, achieving up to +30% gains on GSM8K, +45% on MultiArith, and consistent +20--28% improvements on SVAMP over Chain-of-Thought, Self-Consistency, and CoT-Decoding. Code is available at https://anonymous.4open.science/r/audit-of-understanding-E28B.

Related papers

Preventing the Collapse of Peer Review Requires Verification-First AI [49.995126139461085]
We propose truth-coupling, i.e. how tightly venue scores track latent scientific truth.<n>We formalize two forces that drive a phase transition toward proxy-sovereign evaluation.
arXiv Detail & Related papers (2026-01-23T17:17:32Z)
Critical or Compliant? The Double-Edged Sword of Reasoning in Chain-of-Thought Explanations [60.27156500679296]
We study the role of Chain-of-Thought (CoT) explanations in moral scenarios by systematically perturbing reasoning chains and manipulating delivery tones.<n>Our findings reveal two key effects: (1) users often trust with outcome agreement, sustaining reliance even when reasoning is flawed.<n>These results highlight how CoT explanations can simultaneously clarify and mislead, underscoring the need for NLP systems to provide explanations that encourage scrutiny and critical thinking rather than blind trust.
arXiv Detail & Related papers (2025-11-15T02:38:49Z)
AI Annotation Orchestration: Evaluating LLM verifiers to Improve the Quality of LLM Annotations in Learning Analytics [0.17240671897505613]
Large Language Models (LLMs) are increasingly used to annotate learning interactions, yet concerns about reliability limit their utility.<n>We test whether verification-oriented orchestration-prompting models to check their own labels (self-verification) or audit one another (cross-verification)-improves qualitative coding of tutoring discourse.
arXiv Detail & Related papers (2025-11-12T22:35:36Z)
Hallucination Detection via Internal States and Structured Reasoning Consistency in Large Language Models [7.18947815679122]
Internal State Probing and Chain-of-Thought Verification are used to detect hallucinations in large language models.<n>We develop a unified framework that bridges the gap between the two methods.<n>Our framework consistently and significantly outperforms strong baselines.
arXiv Detail & Related papers (2025-10-13T15:31:21Z)
VeriLLM: A Lightweight Framework for Publicly Verifiable Decentralized Inference [4.158412539499328]
We present a publicly verifiable protocol for decentralized inference for large language models (LLMs)<n>We introduce an isomorphic inference-verification network that multiplexes both roles on the same set of GPU workers.<n>We provide a formal game-theoretic analysis and prove that, under our incentives, honest inference and verification constitute a Nash equilibrium.
arXiv Detail & Related papers (2025-09-29T04:07:32Z)
Eigen-1: Adaptive Multi-Agent Refinement with Monitor-Based RAG for Scientific Reasoning [53.45095336430027]
We develop a unified framework that combines implicit retrieval and structured collaboration.<n>On Humanity's Last Exam (HLE) Bio/Chem Gold, our framework achieves 48.3% accuracy.<n>Results on SuperGPQA and TRQA confirm robustness across domains.
arXiv Detail & Related papers (2025-09-25T14:05:55Z)
Causal Prompting for Implicit Sentiment Analysis with Large Language Models [21.39152516811571]
Implicit Sentiment Analysis (ISA) aims to infer sentiment that is implied rather than explicitly stated.<n>Recent prompting-based methods using Large Language Models (LLMs) have shown promise in ISA.<n>We propose CAPITAL, a causal prompting framework that incorporates front-door adjustment into CoT reasoning.
arXiv Detail & Related papers (2025-07-01T03:01:09Z)
CLATTER: Comprehensive Entailment Reasoning for Hallucination Detection [60.98964268961243]
We propose that guiding models to perform a systematic and comprehensive reasoning process allows models to execute much finer-grained and accurate entailment decisions.<n>We define a 3-step reasoning process, consisting of (i) claim decomposition, (ii) sub-claim attribution and entailment classification, and (iii) aggregated classification, showing that such guided reasoning indeed yields improved hallucination detection.
arXiv Detail & Related papers (2025-06-05T17:02:52Z)
Latent Veracity Inference for Identifying Errors in Stepwise Reasoning [78.29317733206643]
We introduce Veracity Search (VS), a discrete search algorithm over veracity assignments.<n>It performs otherwise intractable inference in the posterior distribution over latent veracity values.<n>It generalizes VS, enabling accurate zero-shot veracity inference in novel contexts.
arXiv Detail & Related papers (2025-05-17T04:16:36Z)
Combining Fine-Tuning and LLM-based Agents for Intuitive Smart Contract Auditing with Justifications [18.138452572457552]
iAudit is a framework for intuitive smart contract auditing with justifications. On a dataset of 263 real smart contract vulnerabilities, iAudit achieves an F1 score of 91.21% and an accuracy of 91.11%.
arXiv Detail & Related papers (2024-03-24T09:26:53Z)
From Chaos to Clarity: Claim Normalization to Empower Fact-Checking [57.024192702939736]
Claim Normalization (aka ClaimNorm) aims to decompose complex and noisy social media posts into more straightforward and understandable forms. We propose CACN, a pioneering approach that leverages chain-of-thought and claim check-worthiness estimation. Our experiments demonstrate that CACN outperforms several baselines across various evaluation measures.
arXiv Detail & Related papers (2023-10-22T16:07:06Z)
Verify-and-Edit: A Knowledge-Enhanced Chain-of-Thought Framework [26.7264686036634]
Large language models (LLMs) have become the norm in NLP, demonstrating good performance in generation and reasoning tasks. One of its most fatal disadvantages is the lack of factual correctness. Generating unfactual texts not only leads to lower performances but also degrades the trust and validity of their applications.
arXiv Detail & Related papers (2023-05-05T03:49:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.