GPT-4 Doesn't Know It's Wrong: An Analysis of Iterative Prompting for
  Reasoning Problems
        - URL: http://arxiv.org/abs/2310.12397v1
- Date: Thu, 19 Oct 2023 00:56:37 GMT
- Title: GPT-4 Doesn't Know It's Wrong: An Analysis of Iterative Prompting for
  Reasoning Problems
- Authors: Kaya Stechly, Matthew Marquez, Subbarao Kambhampati
- Abstract summary: We present a principled empirical study of the performance of GPT4 in solving graph coloring instances or verifying the correctness of candidate colorings.
We show that the observed increase in effectiveness is largely due to the correct solution being fortuitously present in the top-k completions of the prompt.
- Score: 16.284360949127723
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract:   There has been considerable divergence of opinion on the reasoning abilities
of Large Language Models (LLMs). While the initial optimism that reasoning
might emerge automatically with scale has been tempered thanks to a slew of
counterexamples, a wide spread belief in their iterative self-critique
capabilities persists. In this paper, we set out to systematically investigate
the effectiveness of iterative prompting of LLMs in the context of Graph
Coloring, a canonical NP-complete reasoning problem that is related to
propositional satisfiability as well as practical problems like scheduling and
allocation. We present a principled empirical study of the performance of GPT4
in solving graph coloring instances or verifying the correctness of candidate
colorings. In iterative modes, we experiment with the model critiquing its own
answers and an external correct reasoner verifying proposed solutions. In both
cases, we analyze whether the content of the criticisms actually affects bottom
line performance. The study seems to indicate that (i) LLMs are bad at solving
graph coloring instances (ii) they are no better at verifying a solution--and
thus are not effective in iterative modes with LLMs critiquing LLM-generated
solutions (iii) the correctness and content of the criticisms--whether by LLMs
or external solvers--seems largely irrelevant to the performance of iterative
prompting. We show that the observed increase in effectiveness is largely due
to the correct solution being fortuitously present in the top-k completions of
the prompt (and being recognized as such by an external verifier). Our results
thus call into question claims about the self-critiquing capabilities of state
of the art LLMs.
 
      
        Related papers
        - Answer-Centric or Reasoning-Driven? Uncovering the Latent Memory Anchor   in LLMs [28.556628696390767]
 Large Language Models (LLMs) demonstrate impressive reasoning capabilities.<n>Evidence suggests much of their success stems from memorized answer-reasoning patterns rather than genuine inference.<n>We propose a five-level answer-visibility prompt framework that systematically manipulates answer cues and probes model behavior through indirect, behavioral analysis.
 arXiv  Detail & Related papers  (2025-06-21T08:15:45Z)
- No Need for Explanations: LLMs can implicitly learn from mistakes   in-context [14.508050809497847]
 We study why Large Language Models learn from mistakes more effectively without explicit corrective rationales.<n>We find evidence that, while incorrect answers are more beneficial for LLM learning, explicit corrective rationales over-constrain the model.
 arXiv  Detail & Related papers  (2025-02-12T16:31:21Z)
- Critical-Questions-of-Thought: Steering LLM reasoning with Argumentative   Querying [0.3659498819753633]
 State-of-the-art Large Language models (LLMs) continue to struggle when performing logical and mathematical reasoning.
This paper makes use of the notion of critical questions from the literature on argumentation theory, focusing in particular on Toulmin's model of argumentation.
We show that employing these critical questions can improve the reasoning capabilities of LLMs.
 arXiv  Detail & Related papers  (2024-12-19T18:51:30Z)
- Not All LLM Reasoners Are Created Equal [58.236453890457476]
 We study the depth of grade-school math problem-solving capabilities of LLMs.
We evaluate their performance on pairs of existing math word problems together.
 arXiv  Detail & Related papers  (2024-10-02T17:01:10Z)
- Deconfounded Causality-aware Parameter-Efficient Fine-Tuning for   Problem-Solving Improvement of LLMs [12.48241058167222]
 Large Language Models (LLMs) have demonstrated remarkable efficiency in tackling various tasks based on human instructions.
But studies reveal that they often struggle with tasks requiring reasoning, such as math or physics limitation.
This raises questions about whether LLMs truly comprehend embedded knowledge or merely learn to replicate the token distribution without a true understanding of the content.
We propose Decon Causal Adaptation (DCA), a novel parameter-efficient fine-tuning (PEFT) method to enhance the model's reasoning capabilities.
 arXiv  Detail & Related papers  (2024-09-04T13:17:09Z)
- Evaluating Human Alignment and Model Faithfulness of LLM Rationale [66.75309523854476]
 We study how well large language models (LLMs) explain their generations through rationales.
We show that prompting-based methods are less "faithful" than attribution-based explanations.
 arXiv  Detail & Related papers  (2024-06-28T20:06:30Z)
- When Can LLMs Actually Correct Their Own Mistakes? A Critical Survey of   Self-Correction of LLMs [29.295135832861522]
 Self-correction is an approach to improving responses from large language models (LLMs) by refining the responses using LLMs during inference.
Prior work has proposed various self-correction frameworks using different sources of feedback, including self-evaluation and external feedback.
We critically survey broad papers and discuss the conditions required for successful self-correction.
 arXiv  Detail & Related papers  (2024-06-03T13:05:46Z)
- On the Self-Verification Limitations of Large Language Models on   Reasoning and Planning Tasks [17.329365493094542]
 We present a principled empirical study of the performance of GPT-4 in three domains: Game of 24, Graph Coloring, and STRIPS planning.
We observe significant performance collapse with self-critique and significant performance gains with sound external verification.
 arXiv  Detail & Related papers  (2024-02-12T23:11:01Z)
- Self-Contrast: Better Reflection Through Inconsistent Solving   Perspectives [45.87069217634753]
 Research indicates without external feedback, Large Language Model's intrinsic reflection is unstable.
Our investigation unveils that the key bottleneck is the quality of the self-evaluated feedback.
We advocate Self-Contrast: It adaptively explores diverse solving perspectives tailored to the request, contrasts the differences, and summarizes these discrepancies into a checklist which could be used to re-examine and eliminate discrepancies.
 arXiv  Detail & Related papers  (2024-01-04T00:32:33Z)
- The ART of LLM Refinement: Ask, Refine, and Trust [85.75059530612882]
 We propose a reasoning with refinement objective called ART: Ask, Refine, and Trust.
It asks necessary questions to decide when an LLM should refine its output.
It achieves a performance gain of +5 points over self-refinement baselines.
 arXiv  Detail & Related papers  (2023-11-14T07:26:32Z)
- A Closer Look at the Self-Verification Abilities of Large Language   Models in Logical Reasoning [73.77088902676306]
 We take a closer look at the self-verification abilities of large language models (LLMs) in the context of logical reasoning.
Our main findings suggest that existing LLMs could struggle to identify fallacious reasoning steps accurately and may fall short of guaranteeing the validity of self-verification methods.
 arXiv  Detail & Related papers  (2023-11-14T07:13:10Z)
- Large Language Models Cannot Self-Correct Reasoning Yet [78.16697476530994]
 Large Language Models (LLMs) have emerged as a groundbreaking technology with their unparalleled text generation capabilities.
Concerns persist regarding the accuracy and appropriateness of their generated content.
A contemporary methodology, self-correction, has been proposed as a remedy to these issues.
 arXiv  Detail & Related papers  (2023-10-03T04:56:12Z)
- GraphReason: Enhancing Reasoning Capabilities of Large Language Models   through A Graph-Based Verification Approach [0.0]
 Large Language Models (LLMs) have showcased impressive reasoning capabilities.
In this paper, we introduce a novel graph-based method to further augment the reasoning capabilities of LLMs.
 arXiv  Detail & Related papers  (2023-08-18T03:12:59Z)
- Automatically Correcting Large Language Models: Surveying the landscape
  of diverse self-correction strategies [104.32199881187607]
 Large language models (LLMs) have demonstrated remarkable performance across a wide array of NLP tasks.
A promising approach to rectify these flaws is self-correction, where the LLM itself is prompted or guided to fix problems in its own output.
This paper presents a comprehensive review of this emerging class of techniques.
 arXiv  Detail & Related papers  (2023-08-06T18:38:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.