Temporalizing Confidence: Evaluation of Chain-of-Thought Reasoning with Signal Temporal Logic
- URL: http://arxiv.org/abs/2506.08243v1
- Date: Mon, 09 Jun 2025 21:21:12 GMT
- Title: Temporalizing Confidence: Evaluation of Chain-of-Thought Reasoning with Signal Temporal Logic
- Authors: Zhenjiang Mao, Artem Bisliouk, Rohith Reddy Nama, Ivan Ruchkin,
- Abstract summary: We propose a structured framework that models stepwise confidence as a temporal signal and evaluates it using Signal Temporal Logic (STL)<n>In particular, we define formal STL-based constraints to capture desirable temporal properties and compute scores that serve as structured, interpretable confidence estimates.<n>Our approach consistently improves calibration metrics and provides more reliable uncertainty estimates than conventional confidence aggregation and post-hoc calibration.
- Score: 0.12499537119440243
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) have shown impressive performance in mathematical reasoning tasks when guided by Chain-of-Thought (CoT) prompting. However, they tend to produce highly confident yet incorrect outputs, which poses significant risks in domains like education, where users may lack the expertise to assess reasoning steps. To address this, we propose a structured framework that models stepwise confidence as a temporal signal and evaluates it using Signal Temporal Logic (STL). In particular, we define formal STL-based constraints to capture desirable temporal properties and compute robustness scores that serve as structured, interpretable confidence estimates. Our approach also introduces a set of uncertainty reshaping strategies to enforce smoothness, monotonicity, and causal consistency across the reasoning trajectory. Experiments show that our approach consistently improves calibration metrics and provides more reliable uncertainty estimates than conventional confidence aggregation and post-hoc calibration.
Related papers
- Does More Inference-Time Compute Really Help Robustness? [50.47666612618054]
We show that small-scale, open-source models can benefit from inference-time scaling.<n>We identify an important security risk, intuitively motivated and empirically verified as an inverse scaling law.<n>We urge practitioners to carefully weigh these subtle trade-offs before applying inference-time scaling in security-sensitive, real-world applications.
arXiv Detail & Related papers (2025-07-21T18:08:38Z) - T-CPDL: A Temporal Causal Probabilistic Description Logic for Developing Logic-RAG Agent [5.439020425819001]
Temporal Causal Probabilistic Description Logic (T-CPDL) is an integrated framework that extends Description Logic with temporal interval operators, explicit causal relationships, and probabilistic annotations.<n>T-CPDL substantially improves inference accuracy, interpretability, and confidence calibration of language model outputs.<n>This work also lays the groundwork for developing advanced Logic-Retrieval-Augmented Generation (Logic-RAG) frameworks.
arXiv Detail & Related papers (2025-06-23T12:11:15Z) - Verbalized Confidence Triggers Self-Verification: Emergent Behavior Without Explicit Reasoning Supervision [12.287123198288079]
Uncertainty calibration is essential for the safe deployment of large language models (LLMs)<n>We find that supervised fine-tuning with scalar confidence labels alone suffices to elicit self-verification behavior of language models.<n>We propose a simple rethinking method that boosts performance via test-time scaling based on calibrated uncertainty.
arXiv Detail & Related papers (2025-06-04T08:56:24Z) - Aurora: Are Android Malware Classifiers Reliable under Distribution Shift? [51.12297424766236]
AURORA is a framework to evaluate malware classifiers based on their confidence quality and operational resilience.<n>AURORA is further complemented by a set of metrics designed to go beyond point-in-time performance.<n>The fragility we observe in state-of-the-art frameworks suggests the need for a return to the whiteboard.
arXiv Detail & Related papers (2025-05-28T20:22:43Z) - ConCISE: Confidence-guided Compression in Step-by-step Efficient Reasoning [75.1101108949743]
Large Reasoning Models (LRMs) perform strongly in complex reasoning tasks via Chain-of-Thought (CoT) prompting.<n>LRMs often suffer from verbose outputs caused by redundant content, increasing computational overhead, and degrading user experience.<n>We propose ConCISE, a framework that simplifies reasoning chains by reinforcing the model's confidence during inference.
arXiv Detail & Related papers (2025-05-08T01:40:40Z) - Enhancing LLM Reliability via Explicit Knowledge Boundary Modeling [48.15636223774418]
Large language models (LLMs) frequently hallucinate due to misaligned self-awareness.<n>Existing approaches mitigate hallucinations via uncertainty estimation or query rejection.<n>We propose the Explicit Knowledge Boundary Modeling framework to integrate fast and slow reasoning systems.
arXiv Detail & Related papers (2025-03-04T03:16:02Z) - FTS: A Framework to Find a Faithful TimeSieve [27.500523513445497]
We propose a novel framework aimed at identifying and rectifying unfaithfulness in TimeSieve.<n>Our framework is designed to enhance the model's stability and resilience, ensuring that its outputs are less susceptible to the aforementioned factors.
arXiv Detail & Related papers (2024-05-30T02:59:49Z) - Score Matching-based Pseudolikelihood Estimation of Neural Marked
Spatio-Temporal Point Process with Uncertainty Quantification [59.81904428056924]
We introduce SMASH: a Score MAtching estimator for learning markedPs with uncertainty quantification.
Specifically, our framework adopts a normalization-free objective by estimating the pseudolikelihood of markedPs through score-matching.
The superior performance of our proposed framework is demonstrated through extensive experiments in both event prediction and uncertainty quantification.
arXiv Detail & Related papers (2023-10-25T02:37:51Z) - When Does Confidence-Based Cascade Deferral Suffice? [69.28314307469381]
Cascades are a classical strategy to enable inference cost to vary adaptively across samples.
A deferral rule determines whether to invoke the next classifier in the sequence, or to terminate prediction.
Despite being oblivious to the structure of the cascade, confidence-based deferral often works remarkably well in practice.
arXiv Detail & Related papers (2023-07-06T04:13:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.