ReCEval: Evaluating Reasoning Chains via Correctness and Informativeness
- URL: http://arxiv.org/abs/2304.10703v2
- Date: Thu, 30 Nov 2023 23:33:07 GMT
- Title: ReCEval: Evaluating Reasoning Chains via Correctness and Informativeness
- Authors: Archiki Prasad, Swarnadeep Saha, Xiang Zhou, Mohit Bansal
- Abstract summary: ReCEval is a framework that evaluates reasoning chains via two key properties: correctness and informativeness.
We show that ReCEval effectively identifies various error types and yields notable improvements compared to prior methods.
- Score: 67.49087159888298
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-step reasoning ability is fundamental to many natural language tasks,
yet it is unclear what constitutes a good reasoning chain and how to evaluate
them. Most existing methods focus solely on whether the reasoning chain leads
to the correct conclusion, but this answer-oriented view may confound reasoning
quality with other spurious shortcuts to predict the answer. To bridge this
gap, we evaluate reasoning chains by viewing them as informal proofs that
derive the final answer. Specifically, we propose ReCEval (Reasoning Chain
Evaluation), a framework that evaluates reasoning chains via two key
properties: (1) correctness, i.e., each step makes a valid inference based on
information contained within the step, preceding steps, and input context, and
(2) informativeness, i.e., each step provides new information that is helpful
towards deriving the generated answer. We evaluate these properties by
developing metrics using natural language inference models and V-Information.
On multiple datasets, we show that ReCEval effectively identifies various error
types and yields notable improvements compared to prior methods. We analyze the
impact of step boundaries, and previous steps on evaluating correctness and
demonstrate that our informativeness metric captures the expected flow of
information in high-quality reasoning chains. Finally, we show that scoring
reasoning chains based on ReCEval improves downstream task performance. Our
code is publicly available at: https://github.com/archiki/ReCEval
Related papers
- Improve Vision Language Model Chain-of-thought Reasoning [86.83335752119741]
Chain-of-thought (CoT) reasoning in vision language models (VLMs) is crucial for improving interpretability and trustworthiness.
We show that training VLM on short answers does not generalize well to reasoning tasks that require more detailed responses.
arXiv Detail & Related papers (2024-10-21T17:00:06Z) - Chain-of-Probe: Examing the Necessity and Accuracy of CoT Step-by-Step [81.50681925980135]
We propose a method to probe changes in the mind during the model's reasoning.
By analyzing patterns in mind change, we examine the correctness of the model's reasoning.
Our validation reveals that many responses, although correct in their final answer, contain errors in their reasoning process.
arXiv Detail & Related papers (2024-06-23T15:50:22Z) - LLM Critics Help Catch Bugs in Mathematics: Towards a Better Mathematical Verifier with Natural Language Feedback [71.95402654982095]
We propose Math-Minos, a natural language feedback-enhanced verifier.
Our experiments reveal that a small set of natural language feedback can significantly boost the performance of the verifier.
arXiv Detail & Related papers (2024-06-20T06:42:27Z) - General Purpose Verification for Chain of Thought Prompting [16.381123651223763]
We explore ways to improve reasoning capabilities of Large Language Models (LLMs)
We propose three general principles that a model should adhere to while reasoning.
We apply these constraints to the reasoning steps generated by the LLM to improve the accuracy of the final generation.
arXiv Detail & Related papers (2024-04-30T21:15:17Z) - Information Re-Organization Improves Reasoning in Large Language Models [22.2946033364035]
We propose an information re-organization (InfoRE) method to enhance the reasoning ability of large language models (LLMs)
Our method involves extracting logical relationships from the contextual content, such as documents or paragraphs, and subsequently pruning redundant content to minimize noise.
To demonstrate the effectiveness of our approach in improving the reasoning ability, we conduct experiments using Llama2-70B, GPT-3.5, and GPT-4 on various contextually aware multi-hop reasoning tasks.
arXiv Detail & Related papers (2024-04-22T08:47:27Z) - A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains [33.46649770312231]
Prompting language models to provide step-by-step answers is a prominent approach for complex reasoning tasks.
No fine-grained step-level datasets are available to enable thorough evaluation of such verification methods.
We introduce REVEAL: Reasoning Verification Evaluation, a dataset to benchmark automatic verifiers of complex Chain-of-Thought reasoning.
arXiv Detail & Related papers (2024-02-01T12:46:45Z) - QA-NatVer: Question Answering for Natural Logic-based Fact Verification [11.002475880349452]
We propose to use question answering to predict natural logic operators.
In a few-shot setting on FEVER, our approach outperforms the best baseline by $4.3$ accuracy points.
A human evaluation indicates that our approach produces more plausible with fewer erroneous natural logic operators than previous natural logic-based systems.
arXiv Detail & Related papers (2023-10-22T06:27:31Z) - ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning [63.77667876176978]
Large language models show improved downstream task interpretability when prompted to generate step-by-step reasoning to justify their final answers.
These reasoning steps greatly improve model interpretability and verification, but objectively studying their correctness is difficult.
We present ROS, a suite of interpretable, unsupervised automatic scores that improve and extend previous text generation evaluation metrics.
arXiv Detail & Related papers (2022-12-15T15:52:39Z) - Search Methods for Sufficient, Socially-Aligned Feature Importance
Explanations with In-Distribution Counterfactuals [72.00815192668193]
Feature importance (FI) estimates are a popular form of explanation, and they are commonly created and evaluated by computing the change in model confidence caused by removing certain input features at test time.
We study several under-explored dimensions of FI-based explanations, providing conceptual and empirical improvements for this form of explanation.
arXiv Detail & Related papers (2021-06-01T20:36:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.