LLMs cannot find reasoning errors, but can correct them given the error location
- URL: http://arxiv.org/abs/2311.08516v3
- Date: Tue, 4 Jun 2024 10:25:13 GMT
- Title: LLMs cannot find reasoning errors, but can correct them given the error location
- Authors: Gladys Tyen, Hassan Mansoor, Victor Cărbune, Peter Chen, Tony Mak,
- Abstract summary: Poor self-correction performance stems from LLMs' inability to find logical mistakes, rather than their ability to correct a known mistake.
We benchmark several state-of-the-art LLMs on their mistake-finding ability and demonstrate that they generally struggle with the task.
We show that it is possible to obtain mistake location information without ground truth labels or in-domain training data.
- Score: 0.9017736137562115
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While self-correction has shown promise in improving LLM outputs in terms of style and quality (e.g. Chen et al., 2023b; Madaan et al., 2023), recent attempts to self-correct logical or reasoning errors often cause correct answers to become incorrect, resulting in worse performances overall (Huang et al., 2023). In this paper, we show that poor self-correction performance stems from LLMs' inability to find logical mistakes, rather than their ability to correct a known mistake. Firstly, we benchmark several state-of-the-art LLMs on their mistake-finding ability and demonstrate that they generally struggle with the task, even in highly objective, unambiguous cases. Secondly, we test the correction abilities of LLMs -- separately from mistake finding -- using a backtracking setup that feeds ground truth mistake location information to the model. We show that this boosts downstream task performance across our 5 reasoning tasks, indicating that LLMs' correction abilities are robust. Finally, we show that it is possible to obtain mistake location information without ground truth labels or in-domain training data. We train a small classifier with out-of-domain data, which exhibits stronger mistake-finding performance than prompting a large model. We release our dataset of LLM-generated logical mistakes, BIG-Bench Mistake, to enable further research into locating LLM reasoning mistakes.
Related papers
- SpecTool: A Benchmark for Characterizing Errors in Tool-Use LLMs [77.79172008184415]
SpecTool is a new benchmark to identify error patterns in LLM output on tool-use tasks.
We show that even the most prominent LLMs exhibit these error patterns in their outputs.
Researchers can use the analysis and insights from SPECTOOL to guide their error mitigation strategies.
arXiv Detail & Related papers (2024-11-20T18:56:22Z) - LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations [46.351064535592336]
Large language models (LLMs) often produce errors, including factual inaccuracies, biases, and reasoning failures.
Recent studies have demonstrated that LLMs' internal states encode information regarding the truthfulness of their outputs.
We show that the internal representations of LLMs encode much more information about truthfulness than previously recognized.
arXiv Detail & Related papers (2024-10-03T17:31:31Z) - Small Language Models Need Strong Verifiers to Self-Correct Reasoning [69.94251699982388]
Self-correction has emerged as a promising solution to boost the reasoning performance of large language models (LLMs)
This work explores whether small (= 13B) language models (LMs) have the ability of self-correction on reasoning tasks with minimal inputs from stronger LMs.
arXiv Detail & Related papers (2024-04-26T03:41:28Z) - Evaluating LLMs at Detecting Errors in LLM Responses [30.645694514606507]
This work introduces ReaLMistake, the first error detection benchmark consisting of objective, realistic, and diverse errors made by LLMs.
We use ReaLMistake to evaluate error detectors based on 12 Large Language Models.
arXiv Detail & Related papers (2024-04-04T17:19:47Z) - Can LLMs Learn from Previous Mistakes? Investigating LLMs' Errors to Boost for Reasoning [34.34977150518316]
textscCoTErrorSet, a new benchmark with 609,432 questions, each designed with both correct and error references.
textbfSelf-rethinking prompting guides LLMs to rethink whether they have made similar previous mistakes.
textbfMistake tuning involves finetuning models in both correct and incorrect reasoning domains.
arXiv Detail & Related papers (2024-03-29T08:30:34Z) - Learning to Check: Unleashing Potentials for Self-Correction in Large Language Models [5.463333911506443]
We aim to enhance the self-checking capabilities of large language models (LLMs) by constructing training data for checking tasks.
We propose a specialized checking format called "Step CoT Check"
Experiments demonstrate that fine-tuning with the "Step CoT Check" format significantly improves the self-checking and self-correction abilities of LLMs.
arXiv Detail & Related papers (2024-02-20T14:23:23Z) - The Earth is Flat? Unveiling Factual Errors in Large Language Models [89.94270049334479]
Large Language Models (LLMs) like ChatGPT are in various applications due to their extensive knowledge from pre-training and fine-tuning.
Despite this, they are prone to generating factual and commonsense errors, raising concerns in critical areas like healthcare, journalism, and education.
We introduce a novel, automatic testing framework, FactChecker, aimed at uncovering factual inaccuracies in LLMs.
arXiv Detail & Related papers (2024-01-01T14:02:27Z) - Learning From Mistakes Makes LLM Better Reasoner [106.48571828587728]
Large language models (LLMs) recently exhibited remarkable reasoning capabilities on solving math problems.
This work explores whether LLMs can LEarn from MistAkes (LEMA), akin to the human learning process.
arXiv Detail & Related papers (2023-10-31T17:52:22Z) - Assessing the Reliability of Large Language Model Knowledge [78.38870272050106]
Large language models (LLMs) have been treated as knowledge bases due to their strong performance in knowledge probing tasks.
How do we evaluate the capabilities of LLMs to consistently produce factually correct answers?
We propose MOdel kNowledge relIabiliTy scORe (MONITOR), a novel metric designed to directly measure LLMs' factual reliability.
arXiv Detail & Related papers (2023-10-15T12:40:30Z) - SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step
Reasoning [55.76083560152823]
SelfCheck is a general-purpose zero-shot verification schema for recognizing errors in step-by-step reasoning.
We test SelfCheck on three datasets (GSM8K, MathQA, and MATH) and find that it successfully recognizes errors and, in turn, increases final answer accuracies.
arXiv Detail & Related papers (2023-08-01T10:31:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.