Related papers: Verifying Large Language Models' Reasoning Paths via Correlation Matrix Rank

Verifying Large Language Models' Reasoning Paths via Correlation Matrix Rank

URL: http://arxiv.org/abs/2510.24299v1
Date: Tue, 28 Oct 2025 11:01:10 GMT
Title: Verifying Large Language Models' Reasoning Paths via Correlation Matrix Rank
Authors: Jiayu Liu, Wei Dai, Zhenya Huang, Ning Miao, Enhong Chen,
Abstract summary: Large language models (LLMs) are prone to errors and hallucinations.<n>How to check their outputs effectively and efficiently has become a critical problem in their applications.
Score: 71.09032766271493
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite the strong reasoning ability of large language models~(LLMs), they are prone to errors and hallucinations. As a result, how to check their outputs effectively and efficiently has become a critical problem in their applications. Existing checking methods heavily rely on external resources, such as trained verifiers (e.g., process/outcome reward models) or elaborate prompts, which lead to high computational overhead and are only applicable to specific domains. In this paper, we investigate whether the internal behaviors of LLMs have already implied the credibility of their reasoning paths. Specifically, we find that the rank of the correlation matrix between the input problem and the output reasoning path is a robust indicator of reasoning correctness. Different from other correctness indicators for LLMs, the calculation of the correlation matrix only relies on the LLM itself, which avoids the hassle of training a separate model or designing complicated prompts. Based on it, we design a simple, plug-and-play Self-Indicator method to reweight candidate reasoning paths, which achieves significant performance improvements than other voting and verification methods with very few computational overhead. Our experiments across multiple LLMs of varying scales and model families have further shown the effectiveness of Self-Indicator. It achieves over 75% accuracy in distinguishing correct reasoning paths from incorrect ones, and, in turn, improves the accuracies on three reasoning benchmarks by more than 8%.

Related papers

Demystifying Errors in LLM Reasoning Traces: An Empirical Study of Code Execution Simulation [7.377446354867118]
We conduct the first empirical study on runtime behavior inference with large language models (LLMs)<n>We evaluate four state-of-the-art reasoning LLMs and develop a taxonomy with nine categories of inference errors.<n>Using failures in the Computation category as a case study, our experiments show that this approach corrects 58 percent of such errors.
arXiv Detail & Related papers (2025-11-28T21:29:09Z)
Reasoning with Confidence: Efficient Verification of LLM Reasoning Steps via Uncertainty Heads [104.9566359759396]
We propose a lightweight alternative for step-level reasoning verification based on data-driven uncertainty scores.<n>Our findings suggest that the internal states of LLMs encode their uncertainty and can serve as reliable signals for reasoning verification.
arXiv Detail & Related papers (2025-11-09T03:38:29Z)
Can LLMs Correct Themselves? A Benchmark of Self-Correction in LLMs [57.10533368622962]
Self-correction of large language models (LLMs) emerges as a critical component for enhancing their reasoning performance.<n>This study introduces CorrectBench, a benchmark developed to evaluate the effectiveness of self-correction strategies.<n>Our findings reveal that: 1) Self-correction methods can improve accuracy, especially for complex reasoning tasks; 2) Mixing different self-correction strategies yields further improvements, though it reduces efficiency; and 3) Reasoning LLMs (e.g., DeepSeek-R1) have limited optimization under additional self-correction methods and have high time costs.
arXiv Detail & Related papers (2025-10-17T02:40:19Z)
Verifying the Verifiers: Unveiling Pitfalls and Potentials in Fact Verifiers [59.168391398830515]
We evaluate 12 pre-trained LLMs and one specialized fact-verifier, using a collection of examples from 14 fact-checking benchmarks.<n>We highlight the importance of addressing annotation errors and ambiguity in datasets.<n> frontier LLMs with few-shot in-context examples, often overlooked in previous works, achieve top-tier performance.
arXiv Detail & Related papers (2025-06-16T10:32:10Z)
Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification [52.095460362197336]
Large language models (LLMs) struggle with consistent and accurate reasoning. LLMs are trained primarily on correct solutions, reducing their ability to detect and learn from errors. We propose a novel collaborative method integrating Chain-of-Thought (CoT) and Program-of-Thought (PoT) solutions for verification.
arXiv Detail & Related papers (2024-10-05T05:21:48Z)
Evaluating Human Alignment and Model Faithfulness of LLM Rationale [66.75309523854476]
We study how well large language models (LLMs) explain their generations through rationales. We show that prompting-based methods are less "faithful" than attribution-based explanations.
arXiv Detail & Related papers (2024-06-28T20:06:30Z)
Benchmarking Causal Study to Interpret Large Language Models for Source Code [6.301373791541809]
This paper introduces a benchmarking strategy named Galeras comprised of curated testbeds for three SE tasks. We illustrate the insights of our benchmarking strategy by conducting a case study on the performance of ChatGPT under distinct prompt engineering methods.
arXiv Detail & Related papers (2023-08-23T20:32:12Z)
SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning [55.76083560152823]
SelfCheck is a general-purpose zero-shot verification schema for recognizing errors in step-by-step reasoning. We test SelfCheck on three datasets (GSM8K, MathQA, and MATH) and find that it successfully recognizes errors and, in turn, increases final answer accuracies.
arXiv Detail & Related papers (2023-08-01T10:31:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.