Related papers: ReCEval: Evaluating Reasoning Chains via Correctness and Informativeness

ReCEval: Evaluating Reasoning Chains via Correctness and Informativeness

URL: http://arxiv.org/abs/2304.10703v2
Date: Thu, 30 Nov 2023 23:33:07 GMT
Title: ReCEval: Evaluating Reasoning Chains via Correctness and Informativeness
Authors: Archiki Prasad, Swarnadeep Saha, Xiang Zhou, Mohit Bansal
Abstract summary: ReCEval is a framework that evaluates reasoning chains via two key properties: correctness and informativeness. We show that ReCEval effectively identifies various error types and yields notable improvements compared to prior methods.
Score: 67.49087159888298
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multi-step reasoning ability is fundamental to many natural language tasks, yet it is unclear what constitutes a good reasoning chain and how to evaluate them. Most existing methods focus solely on whether the reasoning chain leads to the correct conclusion, but this answer-oriented view may confound reasoning quality with other spurious shortcuts to predict the answer. To bridge this gap, we evaluate reasoning chains by viewing them as informal proofs that derive the final answer. Specifically, we propose ReCEval (Reasoning Chain Evaluation), a framework that evaluates reasoning chains via two key properties: (1) correctness, i.e., each step makes a valid inference based on information contained within the step, preceding steps, and input context, and (2) informativeness, i.e., each step provides new information that is helpful towards deriving the generated answer. We evaluate these properties by developing metrics using natural language inference models and V-Information. On multiple datasets, we show that ReCEval effectively identifies various error types and yields notable improvements compared to prior methods. We analyze the impact of step boundaries, and previous steps on evaluating correctness and demonstrate that our informativeness metric captures the expected flow of information in high-quality reasoning chains. Finally, we show that scoring reasoning chains based on ReCEval improves downstream task performance. Our code is publicly available at: https://github.com/archiki/ReCEval

Related papers

Answer Convergence as a Signal for Early Stopping in Reasoning [7.60104447055814]
Chain-of-thought (CoT) prompting enhances reasoning in large language models (LLMs)<n>We propose three inference-time strategies to improve efficiency: (1) early stopping via answer consistency, (2) boosting the probability of generating end-of-reasoning signals, and (3) a supervised method that learns when to stop based on internal activations.
arXiv Detail & Related papers (2025-06-03T07:20:54Z)
Beyond the Last Answer: Your Reasoning Trace Uncovers More than You Think [51.0691253204425]
We analyze intermediate reasoning steps, termed subthoughts, to answer two questions: Does the final answer reliably represent the model's optimal conclusion? Our approach involves segmenting a reasoning trace into sequential subthoughts based on linguistic cues. We find that aggregating these answers by selecting the most frequent one (the mode) often yields significantly higher accuracy compared to relying solely on the answer derived from the original complete trace.
arXiv Detail & Related papers (2025-04-29T12:39:07Z)
STRIVE: Structured Reasoning for Self-Improvement in Claim Verification [21.00145637520767]
We propose STRIVE: Structured Reasoning for Self-Improved Verification. Our method introduces a structured reasoning design with Claim Decomposition, Entity Analysis, and Evidence Grounding Verification. It is then applied to generate reasoning chains for all training examples, selecting only those that are correct and structurally sound for subsequent self-improvement training.
arXiv Detail & Related papers (2025-02-17T16:07:07Z)
Improve Vision Language Model Chain-of-thought Reasoning [86.83335752119741]
Chain-of-thought (CoT) reasoning in vision language models (VLMs) is crucial for improving interpretability and trustworthiness. We show that training VLM on short answers does not generalize well to reasoning tasks that require more detailed responses.
arXiv Detail & Related papers (2024-10-21T17:00:06Z)
Chain-of-Probe: Examing the Necessity and Accuracy of CoT Step-by-Step [81.50681925980135]
We propose a method to probe changes in the mind during the model's reasoning. By analyzing patterns in mind change, we examine the correctness of the model's reasoning. Our validation reveals that many responses, although correct in their final answer, contain errors in their reasoning process.
arXiv Detail & Related papers (2024-06-23T15:50:22Z)
LLM Critics Help Catch Bugs in Mathematics: Towards a Better Mathematical Verifier with Natural Language Feedback [71.95402654982095]
We propose Math-Minos, a natural language feedback-enhanced verifier. Our experiments reveal that a small set of natural language feedback can significantly boost the performance of the verifier.
arXiv Detail & Related papers (2024-06-20T06:42:27Z)
General Purpose Verification for Chain of Thought Prompting [16.381123651223763]
We explore ways to improve reasoning capabilities of Large Language Models (LLMs) We propose three general principles that a model should adhere to while reasoning. We apply these constraints to the reasoning steps generated by the LLM to improve the accuracy of the final generation.
arXiv Detail & Related papers (2024-04-30T21:15:17Z)
Information Re-Organization Improves Reasoning in Large Language Models [22.2946033364035]
We propose an information re-organization (InfoRE) method to enhance the reasoning ability of large language models (LLMs) Our method involves extracting logical relationships from the contextual content, such as documents or paragraphs, and subsequently pruning redundant content to minimize noise. To demonstrate the effectiveness of our approach in improving the reasoning ability, we conduct experiments using Llama2-70B, GPT-3.5, and GPT-4 on various contextually aware multi-hop reasoning tasks.
arXiv Detail & Related papers (2024-04-22T08:47:27Z)
A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains [33.46649770312231]
Prompting language models to provide step-by-step answers is a prominent approach for complex reasoning tasks. No fine-grained step-level datasets are available to enable thorough evaluation of such verification methods. We introduce REVEAL: Reasoning Verification Evaluation, a dataset to benchmark automatic verifiers of complex Chain-of-Thought reasoning.
arXiv Detail & Related papers (2024-02-01T12:46:45Z)
QA-NatVer: Question Answering for Natural Logic-based Fact Verification [11.002475880349452]
We propose to use question answering to predict natural logic operators. In a few-shot setting on FEVER, our approach outperforms the best baseline by $4.3$ accuracy points. A human evaluation indicates that our approach produces more plausible with fewer erroneous natural logic operators than previous natural logic-based systems.
arXiv Detail & Related papers (2023-10-22T06:27:31Z)
ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning [63.77667876176978]
Large language models show improved downstream task interpretability when prompted to generate step-by-step reasoning to justify their final answers. These reasoning steps greatly improve model interpretability and verification, but objectively studying their correctness is difficult. We present ROS, a suite of interpretable, unsupervised automatic scores that improve and extend previous text generation evaluation metrics.
arXiv Detail & Related papers (2022-12-15T15:52:39Z)
Search Methods for Sufficient, Socially-Aligned Feature Importance Explanations with In-Distribution Counterfactuals [72.00815192668193]
Feature importance (FI) estimates are a popular form of explanation, and they are commonly created and evaluated by computing the change in model confidence caused by removing certain input features at test time. We study several under-explored dimensions of FI-based explanations, providing conceptual and empirical improvements for this form of explanation.
arXiv Detail & Related papers (2021-06-01T20:36:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.