Self-Contradictory Reasoning Evaluation and Detection
- URL: http://arxiv.org/abs/2311.09603v2
- Date: Mon, 19 Feb 2024 18:01:56 GMT
- Title: Self-Contradictory Reasoning Evaluation and Detection
- Authors: Ziyi Liu, Isabelle Lee, Yongkang Du, Soumya Sanyal, Jieyu Zhao
- Abstract summary: We investigate self-contradictory (Self-Contra) reasoning, where the model reasoning does not support predictions.
A higher accuracy does not necessarily correspond to a lower Self-Contra rate.
We find that GPT-4 struggles to effectively detect Self-Contra reasoning.
- Score: 23.737562513392255
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In a plethora of recent work, large language models (LLMs) demonstrated
impressive reasoning ability, but many proposed downstream reasoning tasks
focus on performance-wise evaluation. Two fundamental questions persist: 1) how
reliable is the quality of reasoning, and 2) can models detect unreliable
reasoning? In this paper, we investigate self-contradictory (Self-Contra)
reasoning, where the model reasoning does not support predictions. To address
1), we assess the Self-Contra rate across four datasets and delve into
finer-grained categories of Self-Contra reasoning. We find that LLMs often
contradict themselves when performing reasoning tasks that involve contextual
information understanding or commonsense. Importantly, a higher accuracy does
not necessarily correspond to a lower Self-Contra rate. The model may appear to
generate correct answers but it may take shortcuts in reasoning or skip over
contextual evidence, thereby displaying Self-Contra behaviors with compromised
reasoning. As for 2), we task GPT-4 with identifying Self-Contra reasoning and
finer-grained fallacies. We observe that GPT-4 struggles to effectively detect
Self-Contra reasoning, with significantly low performance compared with human
judgment. Our results indicate that the current LLMs lack robustness necessary
for reliable reasoning and we emphasize the urgent need for establishing best
practices in comprehensive reasoning evaluations beyond accuracy-based metrics.
Related papers
- Assessing the Reasoning Abilities of ChatGPT in the Context of Claim Verification [19.94897851500131]
We evaluate the reasoning capabilities of GPT-3.5-Turbo and GPT-4.
Our study contributes to the growing body of research suggesting that ChatGPT's reasoning processes are unlikely to mirror human-like reasoning.
arXiv Detail & Related papers (2024-02-16T14:52:05Z) - A Closer Look at the Self-Verification Abilities of Large Language Models in Logical Reasoning [73.77088902676306]
We take a closer look at the self-verification abilities of large language models (LLMs) in the context of logical reasoning.
Our main findings suggest that existing LLMs could struggle to identify fallacious reasoning steps accurately and may fall short of guaranteeing the validity of self-verification methods.
arXiv Detail & Related papers (2023-11-14T07:13:10Z) - From Heuristic to Analytic: Cognitively Motivated Strategies for
Coherent Physical Commonsense Reasoning [66.98861219674039]
Heuristic-Analytic Reasoning (HAR) strategies drastically improve the coherence of rationalizations for model decisions.
Our findings suggest that human-like reasoning strategies can effectively improve the coherence and reliability of PLM reasoning.
arXiv Detail & Related papers (2023-10-24T19:46:04Z) - Concise and Organized Perception Facilitates Reasoning in Large Language Models [32.71672086718057]
We show that large language models (LLMs) exhibit failure patterns akin to human-like cognitive biases when dealing with disordered and irrelevant content in reasoning tasks.
We propose a novel reasoning approach named Concise and Organized Perception (COP)
COP carefully analyzes the given statements to identify the most pertinent information while eliminating redundancy efficiently.
arXiv Detail & Related papers (2023-10-05T04:47:49Z) - Towards CausalGPT: A Multi-Agent Approach for Faithful Knowledge
Reasoning via Promoting Causal Consistency in LLMs [63.26541167737355]
We present a framework to increase faithfulness and causality for knowledge-based reasoning.
Our framework outperforms all compared state-of-the-art approaches by large margins.
arXiv Detail & Related papers (2023-08-23T04:59:21Z) - How susceptible are LLMs to Logical Fallacies? [5.723715910568911]
We present LOGICOM, a diagnostic benchmark to assess the robustness of Large Language Models against logical fallacies.
We use this benchmark to evaluate the performance of GPT-3.5 and GPT-4 using a dataset containing controversial topics.
Our findings indicate that both GPT-3.5 and GPT-4 can adjust their opinion through reasoning.
arXiv Detail & Related papers (2023-08-18T23:07:29Z) - Question Decomposition Improves the Faithfulness of Model-Generated
Reasoning [23.34325378824462]
Large language models (LLMs) are difficult to verify the correctness and safety of their behavior.
One approach is to prompt LLMs to externalize their reasoning, by having them generate step-by-step reasoning as they answer a question.
This approach relies on the stated reasoning faithfully reflecting the model's actual reasoning, which is not always the case.
Decomposition-based methods achieve strong performance on question-answering tasks, sometimes approaching that of CoT.
arXiv Detail & Related papers (2023-07-17T00:54:10Z) - Consistency Analysis of ChatGPT [65.268245109828]
This paper investigates the trustworthiness of ChatGPT and GPT-4 regarding logically consistent behaviour.
Our findings suggest that while both models appear to show an enhanced language understanding and reasoning ability, they still frequently fall short of generating logically consistent predictions.
arXiv Detail & Related papers (2023-03-11T01:19:01Z) - Evaluate Confidence Instead of Perplexity for Zero-shot Commonsense
Reasoning [85.1541170468617]
This paper reconsiders the nature of commonsense reasoning and proposes a novel commonsense reasoning metric, Non-Replacement Confidence (NRC)
Our proposed novel method boosts zero-shot performance on two commonsense reasoning benchmark datasets and further seven commonsense question-answering datasets.
arXiv Detail & Related papers (2022-08-23T14:42:14Z) - Logical Satisfiability of Counterfactuals for Faithful Explanations in
NLI [60.142926537264714]
We introduce the methodology of Faithfulness-through-Counterfactuals.
It generates a counterfactual hypothesis based on the logical predicates expressed in the explanation.
It then evaluates if the model's prediction on the counterfactual is consistent with that expressed logic.
arXiv Detail & Related papers (2022-05-25T03:40:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.