How susceptible are LLMs to Logical Fallacies?
- URL: http://arxiv.org/abs/2308.09853v1
- Date: Fri, 18 Aug 2023 23:07:29 GMT
- Title: How susceptible are LLMs to Logical Fallacies?
- Authors: Amirreza Payandeh, Dan Pluth, Jordan Hosier, Xuesu Xiao, Vijay K.
Gurbani
- Abstract summary: We present LOGICOM, a diagnostic benchmark to assess the robustness of Large Language Models against logical fallacies.
We use this benchmark to evaluate the performance of GPT-3.5 and GPT-4 using a dataset containing controversial topics.
Our findings indicate that both GPT-3.5 and GPT-4 can adjust their opinion through reasoning.
- Score: 5.723715910568911
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper investigates the rational thinking capability of Large Language
Models (LLMs) in multi-round argumentative debates by exploring the impact of
fallacious arguments on their logical reasoning performance. More specifically,
we present Logic Competence Measurement Benchmark (LOGICOM), a diagnostic
benchmark to assess the robustness of LLMs against logical fallacies. LOGICOM
involves two agents: a persuader and a debater engaging in a multi-round debate
on a controversial topic, where the persuader tries to convince the debater of
the correctness of its claim. First, LOGICOM assesses the potential of LLMs to
change their opinions through reasoning. Then, it evaluates the debater's
performance in logical reasoning by contrasting the scenario where the
persuader employs logical fallacies against one where logical reasoning is
used. We use this benchmark to evaluate the performance of GPT-3.5 and GPT-4
using a dataset containing controversial topics, claims, and reasons supporting
them. Our findings indicate that both GPT-3.5 and GPT-4 can adjust their
opinion through reasoning. However, when presented with logical fallacies,
GPT-3.5 and GPT-4 are erroneously convinced 41% and 69% more often,
respectively, compared to when logical reasoning is used. Finally, we introduce
a new dataset containing over 5k pairs of logical vs. fallacious arguments. The
source code and dataset of this work are made publicly available.
Related papers
- A Logical Fallacy-Informed Framework for Argument Generation [34.35377699079075]
We introduce FIPO, a fallacy-informed framework that steers Large Language Models toward logically sound arguments.
Our results on argumentation datasets show that our method reduces the fallacy errors by up to 17.5%.
Our code is available atlucamouchel.com/lucamouchel/Logical-Fallacies.
arXiv Detail & Related papers (2024-08-07T08:19:44Z) - Missci: Reconstructing Fallacies in Misrepresented Science [84.32990746227385]
Health-related misinformation on social networks can lead to poor decision-making and real-world dangers.
Missci is a novel argumentation theoretical model for fallacious reasoning.
We present Missci as a dataset to test the critical reasoning abilities of large language models.
arXiv Detail & Related papers (2024-06-05T12:11:10Z) - LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models [52.03659714625452]
Recently developed large language models (LLMs) have been shown to perform remarkably well on a wide range of language understanding tasks.
But, can they really "reason" over the natural language?
This question has been receiving significant research attention and many reasoning skills such as commonsense, numerical, and qualitative have been studied.
arXiv Detail & Related papers (2024-04-23T21:08:49Z) - Reason from Fallacy: Enhancing Large Language Models' Logical Reasoning through Logical Fallacy Understanding [40.2816930342597]
Large Language Models (LLMs) have demonstrated good performance in many reasoning tasks.
But they still struggle with some complicated reasoning tasks including logical reasoning.
We propose five concrete tasks from three cognitive dimensions of WHAT, WHY, and HOW in this paper.
arXiv Detail & Related papers (2024-04-04T08:38:03Z) - Assessing the Reasoning Abilities of ChatGPT in the Context of Claim Verification [19.94897851500131]
We evaluate the reasoning capabilities of GPT-3.5-Turbo and GPT-4.
Our study contributes to the growing body of research suggesting that ChatGPT's reasoning processes are unlikely to mirror human-like reasoning.
arXiv Detail & Related papers (2024-02-16T14:52:05Z) - LogicAsker: Evaluating and Improving the Logical Reasoning Ability of Large Language Models [63.14196038655506]
We introduce LogicAsker, a novel approach for evaluating and enhancing the logical reasoning capabilities of large language models (LLMs)
Our methodology reveals significant gaps in LLMs' learning of logical rules, with identified reasoning failures ranging from 29% to 90% across different models.
We leverage these findings to construct targeted demonstration examples and fine-tune data, notably enhancing logical reasoning in models like GPT-4o by up to 5%.
arXiv Detail & Related papers (2024-01-01T13:53:53Z) - Self-Contradictory Reasoning Evaluation and Detection [31.452161594896978]
We investigate self-contradictory (Self-Contra) reasoning, where the model reasoning does not support its answers.
We find that LLMs often contradict themselves in reasoning tasks involving contextual information understanding or commonsense.
We find that GPT-4 can detect Self-Contra with a 52.2% F1 score, much lower compared to 66.7% for humans.
arXiv Detail & Related papers (2023-11-16T06:22:17Z) - A Closer Look at the Self-Verification Abilities of Large Language Models in Logical Reasoning [73.77088902676306]
We take a closer look at the self-verification abilities of large language models (LLMs) in the context of logical reasoning.
Our main findings suggest that existing LLMs could struggle to identify fallacious reasoning steps accurately and may fall short of guaranteeing the validity of self-verification methods.
arXiv Detail & Related papers (2023-11-14T07:13:10Z) - Sentiment Analysis through LLM Negotiations [58.67939611291001]
A standard paradigm for sentiment analysis is to rely on a singular LLM and makes the decision in a single round.
This paper introduces a multi-LLM negotiation framework for sentiment analysis.
arXiv Detail & Related papers (2023-11-03T12:35:29Z) - LINC: A Neurosymbolic Approach for Logical Reasoning by Combining
Language Models with First-Order Logic Provers [60.009969929857704]
Logical reasoning is an important task for artificial intelligence with potential impacts on science, mathematics, and society.
In this work, we reformulating such tasks as modular neurosymbolic programming, which we call LINC.
We observe significant performance gains on FOLIO and a balanced subset of ProofWriter for three different models in nearly all experimental conditions we evaluate.
arXiv Detail & Related papers (2023-10-23T17:58:40Z) - Can ChatGPT Defend its Belief in Truth? Evaluating LLM Reasoning via
Debate [19.887103433032774]
Large language models (LLMs) have shown impressive performance in complex reasoning tasks.
This work explores testing LLMs' reasoning by engaging with them in a debate-like conversation.
We find that despite their impressive performance, LLMs like ChatGPT cannot maintain their beliefs in truth for a significant portion of examples.
arXiv Detail & Related papers (2023-05-22T15:47:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.