Can ChatGPT Defend its Belief in Truth? Evaluating LLM Reasoning via
Debate
- URL: http://arxiv.org/abs/2305.13160v2
- Date: Tue, 10 Oct 2023 17:34:15 GMT
- Title: Can ChatGPT Defend its Belief in Truth? Evaluating LLM Reasoning via
Debate
- Authors: Boshi Wang, Xiang Yue, Huan Sun
- Abstract summary: Large language models (LLMs) have shown impressive performance in complex reasoning tasks.
This work explores testing LLMs' reasoning by engaging with them in a debate-like conversation.
We find that despite their impressive performance, LLMs like ChatGPT cannot maintain their beliefs in truth for a significant portion of examples.
- Score: 19.887103433032774
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) such as ChatGPT and GPT-4 have shown impressive
performance in complex reasoning tasks. However, it is difficult to know
whether the models are reasoning based on deep understandings of truth and
logic, or leveraging their memorized patterns in a relatively superficial way.
In this work, we explore testing LLMs' reasoning by engaging with them in a
debate-like conversation, where given a question, the LLM and the user need to
discuss to make the correct decision starting from opposing arguments. Upon
mitigating the Clever Hans effect, our task requires the LLM to not only
achieve the correct answer on its own, but also be able to hold and defend its
belief instead of blindly believing or getting misled by the user's (invalid)
arguments and critiques, thus testing in greater depth whether the LLM grasps
the essence of the reasoning required to solve the problem. Across a range of
complex reasoning benchmarks spanning math, commonsense, logic and BIG-Bench
tasks, we find that despite their impressive performance as reported in
existing work on generating correct step-by-step solutions in the beginning,
LLMs like ChatGPT cannot maintain their beliefs in truth for a significant
portion of examples when challenged by oftentimes absurdly invalid arguments.
Our work points to danger zones of model alignment, and also suggests more
careful treatments and interpretations of the recent findings that LLMs can
improve their responses based on feedback.
Related papers
- LLM The Genius Paradox: A Linguistic and Math Expert's Struggle with Simple Word-based Counting Problems [28.72485319617863]
LLMs struggle with some basic tasks that humans find trivial to handle, e.g., counting the number of character r's in the wordstrawberry.
We measure transferability of advanced mathematical and coding reasoning capabilities from specialized LLMs to simple counting tasks.
Compared with strategies such as finetuning and in-context learning, we show that engaging reasoning is the most robust and efficient way to help LLMs better perceive tasks.
arXiv Detail & Related papers (2024-10-18T04:17:16Z) - Not All LLM Reasoners Are Created Equal [58.236453890457476]
We study the depth of grade-school math problem-solving capabilities of LLMs.
We evaluate their performance on pairs of existing math word problems together.
arXiv Detail & Related papers (2024-10-02T17:01:10Z) - Order Matters in Hallucination: Reasoning Order as Benchmark and Reflexive Prompting for Large-Language-Models [0.0]
Large language models (LLMs) have generated significant attention since their inception, finding applications across various academic and industrial domains.
LLMs often suffer from the "hallucination problem", where outputs, though grammatically and logically coherent, lack factual accuracy or are entirely fabricated.
arXiv Detail & Related papers (2024-08-09T14:34:32Z) - LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models [52.03659714625452]
Recently developed large language models (LLMs) have been shown to perform remarkably well on a wide range of language understanding tasks.
But, can they really "reason" over the natural language?
This question has been receiving significant research attention and many reasoning skills such as commonsense, numerical, and qualitative have been studied.
arXiv Detail & Related papers (2024-04-23T21:08:49Z) - Reason from Fallacy: Enhancing Large Language Models' Logical Reasoning through Logical Fallacy Understanding [40.2816930342597]
Large Language Models (LLMs) have demonstrated good performance in many reasoning tasks.
But they still struggle with some complicated reasoning tasks including logical reasoning.
We propose five concrete tasks from three cognitive dimensions of WHAT, WHY, and HOW in this paper.
arXiv Detail & Related papers (2024-04-04T08:38:03Z) - Meaningful Learning: Enhancing Abstract Reasoning in Large Language Models via Generic Fact Guidance [38.49506722997423]
Large language models (LLMs) have developed impressive performance and strong explainability across various reasoning scenarios.
LLMs often struggle to abstract and apply the generic fact to provide consistent and precise answers.
This has sparked a vigorous debate about whether LLMs are genuinely reasoning or merely memorizing.
arXiv Detail & Related papers (2024-03-14T04:06:13Z) - Direct Evaluation of Chain-of-Thought in Multi-hop Reasoning with Knowledge Graphs [52.42505579545893]
Large language models (LLMs) demonstrate strong reasoning abilities when prompted to generate chain-of-thought explanations alongside answers.
We propose a novel discriminative and generative CoT evaluation paradigm to assess LLMs' knowledge of reasoning and the accuracy of the generated CoT.
arXiv Detail & Related papers (2024-02-17T05:22:56Z) - The ART of LLM Refinement: Ask, Refine, and Trust [85.75059530612882]
We propose a reasoning with refinement objective called ART: Ask, Refine, and Trust.
It asks necessary questions to decide when an LLM should refine its output.
It achieves a performance gain of +5 points over self-refinement baselines.
arXiv Detail & Related papers (2023-11-14T07:26:32Z) - Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate [85.3444184685235]
We propose a Multi-Agent Debate (MAD) framework, in which multiple agents express their arguments in the state of "tit for tat" and a judge manages the debate process to obtain a final solution.
Our framework encourages divergent thinking in LLMs which would be helpful for tasks that require deep levels of contemplation.
arXiv Detail & Related papers (2023-05-30T15:25:45Z) - Large Language Models are Better Reasoners with Self-Verification [48.534270563880845]
Large language models (LLMs) have shown strong reasoning ability in several natural language processing tasks.
LLMs with chain of thought (CoT) prompting require multi-step prompting and multi-token prediction, which is highly sensitive to individual mistakes.
We propose and prove that LLMs also have similar self-verification abilities.
arXiv Detail & Related papers (2022-12-19T15:51:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.