GPT-4 Doesn't Know It's Wrong: An Analysis of Iterative Prompting for
Reasoning Problems
- URL: http://arxiv.org/abs/2310.12397v1
- Date: Thu, 19 Oct 2023 00:56:37 GMT
- Title: GPT-4 Doesn't Know It's Wrong: An Analysis of Iterative Prompting for
Reasoning Problems
- Authors: Kaya Stechly, Matthew Marquez, Subbarao Kambhampati
- Abstract summary: We present a principled empirical study of the performance of GPT4 in solving graph coloring instances or verifying the correctness of candidate colorings.
We show that the observed increase in effectiveness is largely due to the correct solution being fortuitously present in the top-k completions of the prompt.
- Score: 16.284360949127723
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There has been considerable divergence of opinion on the reasoning abilities
of Large Language Models (LLMs). While the initial optimism that reasoning
might emerge automatically with scale has been tempered thanks to a slew of
counterexamples, a wide spread belief in their iterative self-critique
capabilities persists. In this paper, we set out to systematically investigate
the effectiveness of iterative prompting of LLMs in the context of Graph
Coloring, a canonical NP-complete reasoning problem that is related to
propositional satisfiability as well as practical problems like scheduling and
allocation. We present a principled empirical study of the performance of GPT4
in solving graph coloring instances or verifying the correctness of candidate
colorings. In iterative modes, we experiment with the model critiquing its own
answers and an external correct reasoner verifying proposed solutions. In both
cases, we analyze whether the content of the criticisms actually affects bottom
line performance. The study seems to indicate that (i) LLMs are bad at solving
graph coloring instances (ii) they are no better at verifying a solution--and
thus are not effective in iterative modes with LLMs critiquing LLM-generated
solutions (iii) the correctness and content of the criticisms--whether by LLMs
or external solvers--seems largely irrelevant to the performance of iterative
prompting. We show that the observed increase in effectiveness is largely due
to the correct solution being fortuitously present in the top-k completions of
the prompt (and being recognized as such by an external verifier). Our results
thus call into question claims about the self-critiquing capabilities of state
of the art LLMs.
Related papers
- Not All LLM Reasoners Are Created Equal [58.236453890457476]
We study the depth of grade-school math problem-solving capabilities of LLMs.
We evaluate their performance on pairs of existing math word problems together.
arXiv Detail & Related papers (2024-10-02T17:01:10Z) - Deconfounded Causality-aware Parameter-Efficient Fine-Tuning for Problem-Solving Improvement of LLMs [12.48241058167222]
Large Language Models (LLMs) have demonstrated remarkable efficiency in tackling various tasks based on human instructions.
But studies reveal that they often struggle with tasks requiring reasoning, such as math or physics limitation.
This raises questions about whether LLMs truly comprehend embedded knowledge or merely learn to replicate the token distribution without a true understanding of the content.
We propose Decon Causal Adaptation (DCA), a novel parameter-efficient fine-tuning (PEFT) method to enhance the model's reasoning capabilities.
arXiv Detail & Related papers (2024-09-04T13:17:09Z) - Evaluating Human Alignment and Model Faithfulness of LLM Rationale [66.75309523854476]
We study how well large language models (LLMs) explain their generations through rationales.
We show that prompting-based methods are less "faithful" than attribution-based explanations.
arXiv Detail & Related papers (2024-06-28T20:06:30Z) - When Can LLMs Actually Correct Their Own Mistakes? A Critical Survey of Self-Correction of LLMs [29.295135832861522]
Self-correction is an approach to improving responses from large language models (LLMs) by refining the responses using LLMs during inference.
Prior work has proposed various self-correction frameworks using different sources of feedback, including self-evaluation and external feedback.
We critically survey broad papers and discuss the conditions required for successful self-correction.
arXiv Detail & Related papers (2024-06-03T13:05:46Z) - On the Self-Verification Limitations of Large Language Models on Reasoning and Planning Tasks [17.329365493094542]
We present a principled empirical study of the performance of GPT-4 in three domains: Game of 24, Graph Coloring, and STRIPS planning.
We observe significant performance collapse with self-critique and significant performance gains with sound external verification.
arXiv Detail & Related papers (2024-02-12T23:11:01Z) - Self-Contrast: Better Reflection Through Inconsistent Solving Perspectives [45.87069217634753]
Research indicates without external feedback, Large Language Model's intrinsic reflection is unstable.
Our investigation unveils that the key bottleneck is the quality of the self-evaluated feedback.
We advocate Self-Contrast: It adaptively explores diverse solving perspectives tailored to the request, contrasts the differences, and summarizes these discrepancies into a checklist which could be used to re-examine and eliminate discrepancies.
arXiv Detail & Related papers (2024-01-04T00:32:33Z) - The ART of LLM Refinement: Ask, Refine, and Trust [85.75059530612882]
We propose a reasoning with refinement objective called ART: Ask, Refine, and Trust.
It asks necessary questions to decide when an LLM should refine its output.
It achieves a performance gain of +5 points over self-refinement baselines.
arXiv Detail & Related papers (2023-11-14T07:26:32Z) - A Closer Look at the Self-Verification Abilities of Large Language Models in Logical Reasoning [73.77088902676306]
We take a closer look at the self-verification abilities of large language models (LLMs) in the context of logical reasoning.
Our main findings suggest that existing LLMs could struggle to identify fallacious reasoning steps accurately and may fall short of guaranteeing the validity of self-verification methods.
arXiv Detail & Related papers (2023-11-14T07:13:10Z) - Large Language Models Cannot Self-Correct Reasoning Yet [78.16697476530994]
Large Language Models (LLMs) have emerged as a groundbreaking technology with their unparalleled text generation capabilities.
Concerns persist regarding the accuracy and appropriateness of their generated content.
A contemporary methodology, self-correction, has been proposed as a remedy to these issues.
arXiv Detail & Related papers (2023-10-03T04:56:12Z) - GraphReason: Enhancing Reasoning Capabilities of Large Language Models through A Graph-Based Verification Approach [0.0]
Large Language Models (LLMs) have showcased impressive reasoning capabilities.
In this paper, we introduce a novel graph-based method to further augment the reasoning capabilities of LLMs.
arXiv Detail & Related papers (2023-08-18T03:12:59Z) - Automatically Correcting Large Language Models: Surveying the landscape
of diverse self-correction strategies [104.32199881187607]
Large language models (LLMs) have demonstrated remarkable performance across a wide array of NLP tasks.
A promising approach to rectify these flaws is self-correction, where the LLM itself is prompted or guided to fix problems in its own output.
This paper presents a comprehensive review of this emerging class of techniques.
arXiv Detail & Related papers (2023-08-06T18:38:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.