On the Self-Verification Limitations of Large Language Models on
Reasoning and Planning Tasks
- URL: http://arxiv.org/abs/2402.08115v1
- Date: Mon, 12 Feb 2024 23:11:01 GMT
- Title: On the Self-Verification Limitations of Large Language Models on
Reasoning and Planning Tasks
- Authors: Kaya Stechly, Karthik Valmeekam, Subbarao Kambhampati
- Abstract summary: We present a principled empirical study of the performance of GPT-4 in three domains: Game of 24, Graph Coloring, and STRIPS planning.
We observe significant performance collapse with self-critique, significant performance gains with sound external verification, but that the content of critique doesn't matter to the performance of the system.
- Score: 19.476470154121188
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There has been considerable divergence of opinion on the reasoning abilities
of Large Language Models (LLMs). While the initial optimism that reasoning
might emerge automatically with scale has been tempered thanks to a slew of
counterexamples--ranging from multiplication to simple planning--there persists
a wide spread belief that LLMs can self-critique and improve their own
solutions in an iterative fashion. This belief seemingly rests on the
assumption that verification of correctness should be easier than generation--a
rather classical argument from computational complexity--which should be
irrelevant to LLMs to the extent that what they are doing is approximate
retrieval. In this paper, we set out to systematically investigate the
effectiveness of iterative prompting in the context of reasoning and planning.
We present a principled empirical study of the performance of GPT-4 in three
domains: Game of 24, Graph Coloring, and STRIPS planning. We experiment both
with the model critiquing its own answers and with an external correct reasoner
verifying proposed solutions. In each case, we analyze whether the content of
criticisms actually affects bottom line performance, and whether we can ablate
elements of the augmented system without losing performance. We observe
significant performance collapse with self-critique, significant performance
gains with sound external verification, but that the content of critique
doesn't matter to the performance of the system. In fact, merely re-prompting
with a sound verifier maintains most of the benefits of more involved setups.
Related papers
- Reflective Instruction Tuning: Mitigating Hallucinations in Large Vision-Language Models [36.119299938503936]
Large vision-language models (LVLMs) have shown promising performance on a variety of vision-language tasks.
They remain susceptible to hallucinations, generating outputs misaligned with visual content or instructions.
We propose reflective instruction tuning, which integrates rationale learning into visual instruction tuning.
arXiv Detail & Related papers (2024-07-16T06:32:45Z) - Small Language Models Need Strong Verifiers to Self-Correct Reasoning [69.94251699982388]
Self-correction has emerged as a promising solution to boost the reasoning performance of large language models (LLMs)
This work explores whether small (= 13B) language models (LMs) have the ability of self-correction on reasoning tasks with minimal inputs from stronger LMs.
arXiv Detail & Related papers (2024-04-26T03:41:28Z) - Distilling Reasoning Ability from Large Language Models with Adaptive Thinking [54.047761094420174]
Chain of thought finetuning (cot-finetuning) aims to endow small language models (SLM) with reasoning ability to improve their performance towards specific tasks.
Most existing cot-finetuning methods adopt a pre-thinking mechanism, allowing the SLM to generate a rationale before providing an answer.
This mechanism enables SLM to analyze and think about complex questions, but it also makes answer correctness highly sensitive to minor errors in rationale.
We propose a robust post-thinking mechanism to generate answers before rationale.
arXiv Detail & Related papers (2024-04-14T07:19:27Z) - Learning From Correctness Without Prompting Makes LLM Efficient Reasoner [30.203952806009717]
Large language models (LLMs) have demonstrated outstanding performance across various tasks, yet they still exhibit limitations such as hallucination, unfaithful reasoning, and toxic content.
We introduce an intrinsic self-correct reasoning framework for LLMs that eliminates the need for human feedback, external tools, and handcraft prompts.
arXiv Detail & Related papers (2024-03-28T02:12:49Z) - The ART of LLM Refinement: Ask, Refine, and Trust [85.75059530612882]
We propose a reasoning with refinement objective called ART: Ask, Refine, and Trust.
It asks necessary questions to decide when an LLM should refine its output.
It achieves a performance gain of +5 points over self-refinement baselines.
arXiv Detail & Related papers (2023-11-14T07:26:32Z) - A Closer Look at the Self-Verification Abilities of Large Language Models in Logical Reasoning [73.77088902676306]
We take a closer look at the self-verification abilities of large language models (LLMs) in the context of logical reasoning.
Our main findings suggest that existing LLMs could struggle to identify fallacious reasoning steps accurately and may fall short of guaranteeing the validity of self-verification methods.
arXiv Detail & Related papers (2023-11-14T07:13:10Z) - Sentiment Analysis through LLM Negotiations [58.67939611291001]
A standard paradigm for sentiment analysis is to rely on a singular LLM and makes the decision in a single round.
This paper introduces a multi-LLM negotiation framework for sentiment analysis.
arXiv Detail & Related papers (2023-11-03T12:35:29Z) - GPT-4 Doesn't Know It's Wrong: An Analysis of Iterative Prompting for
Reasoning Problems [16.284360949127723]
We present a principled empirical study of the performance of GPT4 in solving graph coloring instances or verifying the correctness of candidate colorings.
We show that the observed increase in effectiveness is largely due to the correct solution being fortuitously present in the top-k completions of the prompt.
arXiv Detail & Related papers (2023-10-19T00:56:37Z) - Can Large Language Models Really Improve by Self-critiquing Their Own
Plans? [19.476470154121188]
We investigate the verification/self-critiquing abilities of large language models in the context of planning.
Using GPT-4, a state-of-the-art LLM, for both generation and verification, our findings reveal that self-critiquing appears to diminish plan generation performance.
arXiv Detail & Related papers (2023-10-12T08:22:37Z) - Concise and Organized Perception Facilitates Reasoning in Large Language Models [32.71672086718057]
We show that large language models (LLMs) exhibit failure patterns akin to human-like cognitive biases when dealing with disordered and irrelevant content in reasoning tasks.
We propose a novel reasoning approach named Concise and Organized Perception (COP)
COP carefully analyzes the given statements to identify the most pertinent information while eliminating redundancy efficiently.
arXiv Detail & Related papers (2023-10-05T04:47:49Z) - Large Language Models Cannot Self-Correct Reasoning Yet [78.16697476530994]
Large Language Models (LLMs) have emerged as a groundbreaking technology with their unparalleled text generation capabilities.
Concerns persist regarding the accuracy and appropriateness of their generated content.
A contemporary methodology, self-correction, has been proposed as a remedy to these issues.
arXiv Detail & Related papers (2023-10-03T04:56:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.