Related papers: Confidence over Time: Confidence Calibration with Temporal Logic for Large Language Model Reasoning

Confidence over Time: Confidence Calibration with Temporal Logic for Large Language Model Reasoning

URL: http://arxiv.org/abs/2601.13387v1
Date: Mon, 19 Jan 2026 20:48:06 GMT
Title: Confidence over Time: Confidence Calibration with Temporal Logic for Large Language Model Reasoning
Authors: Zhenjiang Mao, Anirudhh Venkat, Artem Bisliouk, Akshat Kothiyal, Sindhura Kumbakonam Subramanian, Saithej Singhu, Ivan Ruchkin,
Abstract summary: We propose to characterize the stepwise confidence signal using Signal Temporal Logic (STL)<n>Using a discriminative STL mining procedure, we discover temporal formulas that distinguish confidence signals of correct and incorrect responses.<n>We develop a confidence estimation approach that informs STL blocks with parameter hypernetworks.
Score: 0.058633603884542605
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) increasingly rely on long-form, multi-step reasoning to solve complex tasks such as mathematical problem solving and scientific question answering. Despite strong performance, existing confidence estimation methods typically reduce an entire reasoning process to a single scalar score, ignoring how confidence evolves throughout the generation. As a result, these methods are often sensitive to superficial factors such as response length or verbosity, and struggle to distinguish correct reasoning from confidently stated errors. We propose to characterize the stepwise confidence signal using Signal Temporal Logic (STL). Using a discriminative STL mining procedure, we discover temporal formulas that distinguish confidence signals of correct and incorrect responses. Our analysis found that the STL patterns generalize across tasks, and numeric parameters exhibit sensitivity to individual questions. Based on these insights, we develop a confidence estimation approach that informs STL blocks with parameter hypernetworks. Experiments on multiple reasoning tasks show our confidence scores are more calibrated than the baselines.

Related papers

Thinking by Subtraction: Confidence-Driven Contrastive Decoding for LLM Reasoning [58.331709210563616]
Thinking by Subtraction is a confidence-driven contrastive decoding approach.<n>A small subset of low-confidence tokens disproportionately contributes to reasoning errors and unnecessary output expansion.<n>Our method, Confidence-Driven Contrastive Decoding, detects low-confidence tokens during decoding and intervenes at these positions.
arXiv Detail & Related papers (2026-02-20T14:13:22Z)
Balancing Faithfulness and Performance in Reasoning via Multi-Listener Soft Execution [79.98699884805636]
Reasoning Execution by Multiple Listeners (REMUL) is a multi-party reinforcement learning approach.<n>REMUL builds on the hypothesis that reasoning traces which other parties can follow will be more faithful.<n>Speakers are rewarded for producing reasoning that is clear to listeners.
arXiv Detail & Related papers (2026-02-18T02:55:55Z)
Are Reasoning LLMs Robust to Interventions on Their Chain-of-Thought? [79.86483056611105]
Reasoning LLMs generate step-by-step chains of thought before giving an answer.<n>How robust are these reasoning traces to disruptions that occur within them?<n>We introduce a controlled evaluation framework that perturbs a model's own CoT at fixed timesteps.
arXiv Detail & Related papers (2026-02-07T10:02:58Z)
Recurrent Confidence Chain: Temporal-Aware Uncertainty Quantification in Large Language Models [0.0]
Uncertainty of answers can help prevent misleading or serious hallucinations for users.<n>Current methods analyze long reasoning sequences by filtering unrelated tokens and examining potential connections between nearby tokens or sentences.<n>We propose a novel method that incorporates inter-step attention to analyze semantic correlations across steps.
arXiv Detail & Related papers (2026-01-19T20:04:34Z)
Fact-Checking with Large Language Models via Probabilistic Certainty and Consistency [7.806516365113592]
Large language models (LLMs) are increasingly used in applications requiring factual accuracy.<n>While fact-checking can mitigate these errors, existing methods typically retrieve external evidence indiscriminately.<n>We introduce Probabilistic Certainty and Consistency (PCC), a framework that estimates factual confidence.
arXiv Detail & Related papers (2026-01-05T21:57:41Z)
Beware of Reasoning Overconfidence: Pitfalls in the Reasoning Process for Multi-solution Tasks [54.31998314008198]
Large Language Models (LLMs) excel in reasoning tasks requiring a single correct answer, but they perform poorly in multi-solution tasks.<n>We attribute this limitation to textbfreasoning overconfidence: a tendency to express undue certainty in an incomplete solution set.<n>We propose the textbfcognitive-rigidity hypothesis, which posits that overconfidence arises when the reasoning process prematurely converges on a narrow set of thought paths.
arXiv Detail & Related papers (2025-12-01T14:35:06Z)
Thinking, Faithful and Stable: Mitigating Hallucinations in LLMs [0.4115305983711515]
This project develops a self correcting framework for large language models (LLMs)<n>Rather than relying solely on final answer correctness, our approach leverages fine grained uncertainty signals.<n>We design a composite reward function that penalizes unjustified high confidence and entropy spikes.
arXiv Detail & Related papers (2025-11-19T23:09:26Z)
Reasoning with Confidence: Efficient Verification of LLM Reasoning Steps via Uncertainty Heads [104.9566359759396]
We propose a lightweight alternative for step-level reasoning verification based on data-driven uncertainty scores.<n>Our findings suggest that the internal states of LLMs encode their uncertainty and can serve as reliable signals for reasoning verification.
arXiv Detail & Related papers (2025-11-09T03:38:29Z)
BrowseConf: Confidence-Guided Test-Time Scaling for Web Agents [58.05949210993854]
We investigate whether search agents have the ability to communicate their own confidence through verbalized confidence scores after long sequences of actions.<n>We propose Test-Time Scaling (TTS) methods that use confidence scores to determine answer quality, encourage the model to try again until reaching a satisfactory confidence level.
arXiv Detail & Related papers (2025-10-27T15:58:51Z)
Trace Length is a Simple Uncertainty Signal in Reasoning Models [18.432200654999082]
We show that reasoning trace length is a useful confidence estimator in large reasoning models.<n>Our work reveals that reasoning post-training fundamentally alters the relationship between trace length and accuracy.<n>We identify high-entropy or "forking" tokens as playing a key role in the mechanism.
arXiv Detail & Related papers (2025-10-12T02:04:06Z)
ConfTuner: Training Large Language Models to Express Their Confidence Verbally [58.63318088243125]
Large Language Models (LLMs) are increasingly deployed in high-stakes domains such as science, law, and healthcare.<n>LLMs are often observed to generate incorrect answers with high confidence, a phenomenon known as "overconfidence"
arXiv Detail & Related papers (2025-08-26T09:25:32Z)
Temporalizing Confidence: Evaluation of Chain-of-Thought Reasoning with Signal Temporal Logic [0.12499537119440243]
We propose a structured framework that models stepwise confidence as a temporal signal and evaluates it using Signal Temporal Logic (STL)<n>In particular, we define formal STL-based constraints to capture desirable temporal properties and compute scores that serve as structured, interpretable confidence estimates.<n>Our approach consistently improves calibration metrics and provides more reliable uncertainty estimates than conventional confidence aggregation and post-hoc calibration.
arXiv Detail & Related papers (2025-06-09T21:21:12Z)
ConCISE: Confidence-guided Compression in Step-by-step Efficient Reasoning [64.93140713419561]
Large Reasoning Models (LRMs) perform strongly in complex reasoning tasks via Chain-of-Thought (CoT) prompting, but often suffer from verbose outputs.<n>Existing fine-tuning-based compression methods either operate post-hoc pruning, risking disruption to reasoning coherence, or rely on sampling-based selection.<n>We introduce ConCISE, a framework designed to generate concise reasoning chains, integrating Confidence Injection to boost reasoning confidence, and Early Stopping to terminate reasoning when confidence is sufficient.
arXiv Detail & Related papers (2025-05-08T01:40:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.