A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains
- URL: http://arxiv.org/abs/2402.00559v4
- Date: Tue, 21 May 2024 09:55:07 GMT
- Title: A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains
- Authors: Alon Jacovi, Yonatan Bitton, Bernd Bohnet, Jonathan Herzig, Or Honovich, Michael Tseng, Michael Collins, Roee Aharoni, Mor Geva,
- Abstract summary: Prompting language models to provide step-by-step answers is a prominent approach for complex reasoning tasks.
No fine-grained step-level datasets are available to enable thorough evaluation of such verification methods.
We introduce REVEAL: Reasoning Verification Evaluation, a dataset to benchmark automatic verifiers of complex Chain-of-Thought reasoning.
- Score: 33.46649770312231
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Prompting language models to provide step-by-step answers (e.g., "Chain-of-Thought") is the prominent approach for complex reasoning tasks, where more accurate reasoning chains typically improve downstream task performance. Recent literature discusses automatic methods to verify reasoning to evaluate and improve their correctness. However, no fine-grained step-level datasets are available to enable thorough evaluation of such verification methods, hindering progress in this direction. We introduce REVEAL: Reasoning Verification Evaluation, a dataset to benchmark automatic verifiers of complex Chain-of-Thought reasoning in open-domain question-answering settings. REVEAL includes comprehensive labels for the relevance, attribution to evidence passages, and logical correctness of each reasoning step in a language model's answer, across a variety of datasets and state-of-the-art language models. Evaluation on REVEAL shows that verifiers struggle at verifying reasoning chains - in particular, verifying logical correctness and detecting contradictions. Available at https://reveal-dataset.github.io/ .
Related papers
- LLM Critics Help Catch Bugs in Mathematics: Towards a Better Mathematical Verifier with Natural Language Feedback [71.95402654982095]
We propose Math-Minos, a natural language feedback-enhanced verifier.
Our experiments reveal that a small set of natural language feedback can significantly boost the performance of the verifier.
arXiv Detail & Related papers (2024-06-20T06:42:27Z) - Bisimulation Learning [55.859538562698496]
We compute finite bisimulations of state transition systems with large, possibly infinite state space.
Our technique yields faster verification results than alternative state-of-the-art tools in practice.
arXiv Detail & Related papers (2024-05-24T17:11:27Z) - A Closer Look at the Self-Verification Abilities of Large Language Models in Logical Reasoning [73.77088902676306]
We take a closer look at the self-verification abilities of large language models (LLMs) in the context of logical reasoning.
Our main findings suggest that existing LLMs could struggle to identify fallacious reasoning steps accurately and may fall short of guaranteeing the validity of self-verification methods.
arXiv Detail & Related papers (2023-11-14T07:13:10Z) - QA-NatVer: Question Answering for Natural Logic-based Fact Verification [11.002475880349452]
We propose to use question answering to predict natural logic operators.
In a few-shot setting on FEVER, our approach outperforms the best baseline by $4.3$ accuracy points.
A human evaluation indicates that our approach produces more plausible with fewer erroneous natural logic operators than previous natural logic-based systems.
arXiv Detail & Related papers (2023-10-22T06:27:31Z) - ReCEval: Evaluating Reasoning Chains via Correctness and Informativeness [67.49087159888298]
ReCEval is a framework that evaluates reasoning chains via two key properties: correctness and informativeness.
We show that ReCEval effectively identifies various error types and yields notable improvements compared to prior methods.
arXiv Detail & Related papers (2023-04-21T02:19:06Z) - LAMBADA: Backward Chaining for Automated Reasoning in Natural Language [11.096348678079574]
Backward Chaining algorithm, called LAMBADA, decomposes reasoning into four sub-modules.
We show that LAMBADA achieves sizable accuracy boosts over state-of-the-art forward reasoning methods.
arXiv Detail & Related papers (2022-12-20T18:06:03Z) - WeCheck: Strong Factual Consistency Checker via Weakly Supervised
Learning [40.5830891229718]
We propose a weakly supervised framework that aggregates multiple resources to train a precise and efficient factual metric, namely WeCheck.
Comprehensive experiments on a variety of tasks demonstrate the strong performance of WeCheck, which achieves a 3.4% absolute improvement over previous state-of-the-art methods on TRUE benchmark on average.
arXiv Detail & Related papers (2022-12-20T08:04:36Z) - ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning [63.77667876176978]
Large language models show improved downstream task interpretability when prompted to generate step-by-step reasoning to justify their final answers.
These reasoning steps greatly improve model interpretability and verification, but objectively studying their correctness is difficult.
We present ROS, a suite of interpretable, unsupervised automatic scores that improve and extend previous text generation evaluation metrics.
arXiv Detail & Related papers (2022-12-15T15:52:39Z) - Reasoning over Hybrid Chain for Table-and-Text Open Domain QA [69.8436986668218]
We propose a ChAin-centric Reasoning and Pre-training framework (CARP)
CARP utilizes hybrid chain to model the explicit intermediate reasoning process across table and text for question answering.
We also propose a novel chain-centric pre-training method, to enhance the pre-trained model in identifying the cross-modality reasoning process.
arXiv Detail & Related papers (2022-01-15T16:11:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.