Trustworthy Reasoning: Evaluating and Enhancing Factual Accuracy in LLM Intermediate Thought Processes
- URL: http://arxiv.org/abs/2507.22940v2
- Date: Sat, 02 Aug 2025 10:16:31 GMT
- Title: Trustworthy Reasoning: Evaluating and Enhancing Factual Accuracy in LLM Intermediate Thought Processes
- Authors: Rui Jiao, Yue Zhang, Jinku Li,
- Abstract summary: We present a novel framework addressing a critical vulnerability in Large Language Models (LLMs)<n>This phenomenon poses substantial risks in high-stakes domains including healthcare, legal analysis, and scientific research.
- Score: 16.451488374845407
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We present a novel framework addressing a critical vulnerability in Large Language Models (LLMs): the prevalence of factual inaccuracies within intermediate reasoning steps despite correct final answers. This phenomenon poses substantial risks in high-stakes domains including healthcare, legal analysis, and scientific research, where erroneous yet confidently presented reasoning can mislead users into dangerous decisions. Our framework integrates three core components: (1) a specialized fact-checking classifier trained on counterfactually augmented data to detect subtle factual inconsistencies within reasoning chains; (2) an enhanced Group Relative Policy Optimization (GRPO) reinforcement learning approach that balances factuality, coherence, and structural correctness through multi-dimensional rewards; and (3) a mechanistic interpretability method examining how factuality improvements manifest in model activations during reasoning processes. Extensive evaluation across multi state-of-the-art models reveals concerning patterns: even leading models like Claude-3.7 and GPT-o1 demonstrate reasoning factual accuracy of only 81.93% and 82.57% respectively. Our approach significantly enhances factual robustness (up to 49.90% improvement) while maintaining or improving performance on challenging benchmarks including Math-500, AIME-2024, and GPQA. Furthermore, our neural activation-level analysis provides actionable insights into how factual enhancements reshape reasoning trajectories within model architectures, establishing foundations for future training methodologies that explicitly target factual robustness through activation-guided optimization.
Related papers
- Discovering Hierarchical Latent Capabilities of Language Models via Causal Representation Learning [22.32435186013626]
We propose a causal representation learning framework wherein observed benchmark performance is modeled as a linear transformation of a few latent capability factors.<n>Applying this approach to a comprehensive dataset encompassing over 1500 models evaluated across six benchmarks, we identify a concise three-node linear causal structure that reliably explains the observed performance variations.
arXiv Detail & Related papers (2025-06-12T06:07:42Z) - Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning [82.43575191712726]
We introduce a fine-grained analytic framework to dissect the impact ofReinforcement learning on reasoning.<n>Our framework specifically investigates key elements that have been hypothesized to benefit from RL training.
arXiv Detail & Related papers (2025-06-05T07:53:59Z) - Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback [59.078756231841574]
Critique-GRPO is an online RL framework that integrates both natural language and numerical feedback for effective policy optimization.<n>We show Critique-GRPO consistently outperforms supervised learning and RL-based fine-tuning methods across eight challenging mathematical, STEM, and general reasoning tasks.
arXiv Detail & Related papers (2025-06-03T17:39:02Z) - Rationales Are Not Silver Bullets: Measuring the Impact of Rationales on Model Performance and Reliability [70.4107059502882]
Training language models with rationales augmentation has been shown to be beneficial in many existing works.<n>We conduct comprehensive investigations to thoroughly inspect the impact of rationales on model performance.
arXiv Detail & Related papers (2025-05-30T02:39:37Z) - Structured Thinking Matters: Improving LLMs Generalization in Causal Inference Tasks [0.7988085110283119]
Recent results from the Corr2Cause dataset benchmark reveal that state-of-the-art LLMs only marginally outperform random baselines.<n>We provide the model with the capability to structure its thinking by guiding the model to build a structured knowledge graph.<n> Experiments on the test subset of the Corr2Cause dataset benchmark with Qwen3-32B model (reasoning model) show substantial gains over standard direct prompting methods.
arXiv Detail & Related papers (2025-05-23T15:37:40Z) - Everything You Wanted to Know About LLM-based Vulnerability Detection But Were Afraid to Ask [30.819697001992154]
Large Language Models are a promising tool for automated vulnerability detection.<n>Despite widespread adoption, a critical question remains: Are LLMs truly effective at detecting real-world vulnerabilities?<n>This paper challenges three widely held community beliefs: that LLMs are (i) unreliable, (ii) insensitive to code patches, and (iii) performance-plateaued across model scales.
arXiv Detail & Related papers (2025-04-18T05:32:47Z) - Enhancing LLM Reliability via Explicit Knowledge Boundary Modeling [48.15636223774418]
Large language models (LLMs) are prone to hallucination stemming from misaligned self-awareness.<n>We propose the Explicit Knowledge Boundary Modeling framework to integrate fast and slow reasoning systems to harmonize reliability and usability.
arXiv Detail & Related papers (2025-03-04T03:16:02Z) - Bridging Internal Probability and Self-Consistency for Effective and Efficient LLM Reasoning [53.25336975467293]
We present the first theoretical error decomposition analysis of methods such as perplexity and self-consistency.<n>Our analysis reveals a fundamental trade-off: perplexity methods suffer from substantial model error due to the absence of a proper consistency function.<n>We propose Reasoning-Pruning Perplexity Consistency (RPC), which integrates perplexity with self-consistency, and Reasoning Pruning, which eliminates low-probability reasoning paths.
arXiv Detail & Related papers (2025-02-01T18:09:49Z) - A NotSo Simple Way to Beat Simple Bench [0.0]
This paper presents a novel framework for enhancing reasoning capabilities in large language models (LLMs)<n>We propose a multi-step prompting strategy coupled with global consistency checks to improve model accuracy and robustness.<n>Our results reveal model-specific strengths: Claude excels in maintaining logical consistency, while GPT-4o exhibits exploratory creativity but struggles with ambiguous prompts.
arXiv Detail & Related papers (2024-12-12T16:04:31Z) - Ladder-of-Thought: Using Knowledge as Steps to Elevate Stance Detection [73.31406286956535]
We introduce the Ladder-of-Thought (LoT) for the stance detection task.
LoT directs the small LMs to assimilate high-quality external knowledge, refining the intermediate rationales produced.
Our empirical evaluations underscore LoT's efficacy, marking a 16% improvement over GPT-3.5 and a 10% enhancement compared to GPT-3.5 with CoT on stance detection task.
arXiv Detail & Related papers (2023-08-31T14:31:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.