Assessing the Reasoning Abilities of ChatGPT in the Context of Claim Verification
- URL: http://arxiv.org/abs/2402.10735v2
- Date: Wed, 20 Mar 2024 19:14:54 GMT
- Title: Assessing the Reasoning Abilities of ChatGPT in the Context of Claim Verification
- Authors: John Dougrez-Lewis, Mahmud Elahi Akhter, Yulan He, Maria Liakata,
- Abstract summary: We evaluate the reasoning capabilities of GPT-3.5-Turbo and GPT-4.
Our study contributes to the growing body of research suggesting that ChatGPT's reasoning processes are unlikely to mirror human-like reasoning.
- Score: 19.94897851500131
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The reasoning capabilities of LLMs are currently hotly debated. We examine the issue from the perspective of claim/rumour verification. We propose the first logical reasoning framework designed to break down any claim or rumour paired with evidence into the atomic reasoning steps necessary for verification. Based on our framework, we curate two annotated collections of such claim/evidence pairs: a synthetic dataset from Wikipedia and a real-world set stemming from rumours circulating on Twitter. We use them to evaluate the reasoning capabilities of GPT-3.5-Turbo and GPT-4 (hereinafter referred to as ChatGPT) within the context of our framework, providing a thorough analysis. Our results show that ChatGPT struggles in abductive reasoning, although this can be somewhat mitigated by using manual Chain of Thought (CoT) as opposed to Zero-Shot (ZS) and ZS CoT approaches. Our study contributes to the growing body of research suggesting that ChatGPT's reasoning processes are unlikely to mirror human-like reasoning, and that LLMs need to be more rigorously evaluated to distinguish between hype and actual capabilities, especially in high-stakes real-world tasks such as claim verification.
Related papers
- Towards Faithful Chain-of-Thought: Large Language Models are Bridging Reasoners [19.40385041079461]
Large language models (LLMs) suffer from serious unfaithful chain-of-thought (CoT) issues.
We first study the CoT faithfulness issue at the granularity of CoT steps, identify two reasoning paradigms.
We then conduct a joint analysis of the causal relevance among the context, CoT, and answer during reasoning.
arXiv Detail & Related papers (2024-05-29T09:17:46Z) - RAGged Edges: The Double-Edged Sword of Retrieval-Augmented Chatbots [6.893551641325889]
ChatGPT's tendency to hallucinate -- generate plausible but false information -- poses a significant challenge.
This paper explores how Retrieval-Augmented Generation can counter hallucinations by integrating external knowledge with prompts.
Our results show that RAG increases accuracy in some cases, but can still be misled when prompts directly contradict the model's pre-trained understanding.
arXiv Detail & Related papers (2024-03-02T12:19:04Z) - LogicAsker: Evaluating and Improving the Logical Reasoning Ability of Large Language Models [63.14196038655506]
We introduce LogicAsker, a novel approach for evaluating and enhancing the logical reasoning capabilities of large language models (LLMs)
Our methodology reveals significant gaps in LLMs' learning of logical rules, with identified reasoning failures ranging from 29% to 90% across different models.
We leverage these findings to construct targeted demonstration examples and fine-tune data, notably enhancing logical reasoning in models like GPT-4o by up to 5%.
arXiv Detail & Related papers (2024-01-01T13:53:53Z) - Self-Contradictory Reasoning Evaluation and Detection [31.452161594896978]
We investigate self-contradictory (Self-Contra) reasoning, where the model reasoning does not support its answers.
We find that LLMs often contradict themselves in reasoning tasks involving contextual information understanding or commonsense.
We find that GPT-4 can detect Self-Contra with a 52.2% F1 score, much lower compared to 66.7% for humans.
arXiv Detail & Related papers (2023-11-16T06:22:17Z) - Sentiment Analysis through LLM Negotiations [58.67939611291001]
A standard paradigm for sentiment analysis is to rely on a singular LLM and makes the decision in a single round.
This paper introduces a multi-LLM negotiation framework for sentiment analysis.
arXiv Detail & Related papers (2023-11-03T12:35:29Z) - Reasoning on Graphs: Faithful and Interpretable Large Language Model
Reasoning [104.92384929827776]
Large language models (LLMs) have demonstrated impressive reasoning abilities in complex tasks.
They lack up-to-date knowledge and experience hallucinations during reasoning.
Knowledge graphs (KGs) offer a reliable source of knowledge for reasoning.
arXiv Detail & Related papers (2023-10-02T10:14:43Z) - How susceptible are LLMs to Logical Fallacies? [5.723715910568911]
We present LOGICOM, a diagnostic benchmark to assess the robustness of Large Language Models against logical fallacies.
We use this benchmark to evaluate the performance of GPT-3.5 and GPT-4 using a dataset containing controversial topics.
Our findings indicate that both GPT-3.5 and GPT-4 can adjust their opinion through reasoning.
arXiv Detail & Related papers (2023-08-18T23:07:29Z) - Can ChatGPT Defend its Belief in Truth? Evaluating LLM Reasoning via
Debate [19.887103433032774]
Large language models (LLMs) have shown impressive performance in complex reasoning tasks.
This work explores testing LLMs' reasoning by engaging with them in a debate-like conversation.
We find that despite their impressive performance, LLMs like ChatGPT cannot maintain their beliefs in truth for a significant portion of examples.
arXiv Detail & Related papers (2023-05-22T15:47:31Z) - Consistency Analysis of ChatGPT [65.268245109828]
This paper investigates the trustworthiness of ChatGPT and GPT-4 regarding logically consistent behaviour.
Our findings suggest that while both models appear to show an enhanced language understanding and reasoning ability, they still frequently fall short of generating logically consistent predictions.
arXiv Detail & Related papers (2023-03-11T01:19:01Z) - Can ChatGPT Understand Too? A Comparative Study on ChatGPT and
Fine-tuned BERT [103.57103957631067]
ChatGPT has attracted great attention, as it can generate fluent and high-quality responses to human inquiries.
We evaluate ChatGPT's understanding ability by evaluating it on the most popular GLUE benchmark, and comparing it with 4 representative fine-tuned BERT-style models.
We find that: 1) ChatGPT falls short in handling paraphrase and similarity tasks; 2) ChatGPT outperforms all BERT models on inference tasks by a large margin; 3) ChatGPT achieves comparable performance compared with BERT on sentiment analysis and question answering tasks.
arXiv Detail & Related papers (2023-02-19T12:29:33Z) - Towards Understanding Chain-of-Thought Prompting: An Empirical Study of
What Matters [82.84696222087396]
Chain-of-Thought (CoT) prompting can dramatically improve the multi-step reasoning abilities of large language models (LLMs)
We show that CoT reasoning is possible even with invalid demonstrations.
arXiv Detail & Related papers (2022-12-20T05:20:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.