Related papers: Self-Contradictory Reasoning Evaluation and Detection

Self-Contradictory Reasoning Evaluation and Detection

URL: http://arxiv.org/abs/2311.09603v4
Date: Mon, 21 Oct 2024 04:16:09 GMT
Title: Self-Contradictory Reasoning Evaluation and Detection
Authors: Ziyi Liu, Soumya Sanyal, Isabelle Lee, Yongkang Du, Rahul Gupta, Yang Liu, Jieyu Zhao,
Abstract summary: We investigate self-contradictory (Self-Contra) reasoning, where the model reasoning does not support its answers. We find that LLMs often contradict themselves in reasoning tasks involving contextual information understanding or commonsense. We find that GPT-4 can detect Self-Contra with a 52.2% F1 score, much lower compared to 66.7% for humans.
Score: 31.452161594896978
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In a plethora of recent work, large language models (LLMs) demonstrated impressive reasoning ability, but many proposed downstream reasoning tasks only focus on final answers. Two fundamental questions persist: 1) how consistent is the reasoning, and 2) can models detect unreliable reasoning? In this paper, we investigate self-contradictory (Self-Contra) reasoning, where the model reasoning does not support its answers. To answer 1), we define and assess the Self-Contra rate across three datasets and delve into finer-grained categories of Self-Contra reasoning. We find that LLMs often contradict themselves in reasoning tasks involving contextual information understanding or commonsense. The model may generate correct answers by taking shortcuts in reasoning or overlooking contextual evidence, leading to compromised reasoning. For 2), we task the state-of-the-art model GPT-4 with identifying Self-Contra reasoning and finer-grained fallacies. We find that finer-grained categories enhanced detection can improve GPT-4's ability to detect Self-Contra. However, it is only able to detect Self-Contra with a 52.2% F1 score, much lower compared to 66.7% for humans. Our results indicate that current LLMs lack the robustness necessary for reliable reasoning and we emphasize the urgent need for establishing best practices in comprehensive reasoning evaluations beyond pure performance-based metrics.

Related papers

Beyond the Last Answer: Your Reasoning Trace Uncovers More than You Think [51.0691253204425]
We analyze intermediate reasoning steps, termed subthoughts, to answer two questions: Does the final answer reliably represent the model's optimal conclusion? Our approach involves segmenting a reasoning trace into sequential subthoughts based on linguistic cues. We find that aggregating these answers by selecting the most frequent one (the mode) often yields significantly higher accuracy compared to relying solely on the answer derived from the original complete trace.
arXiv Detail & Related papers (2025-04-29T12:39:07Z)
Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs [28.565225092457897]
Reinforcement learning can drive self-improvement in language models on verifiable tasks. We find that Qwen-2.5-3B far exceeds Llama-3.2-3B under identical RL training for the game of Countdown. Our study reveals that Qwen naturally exhibits these reasoning behaviors, whereas Llama initially lacks them.
arXiv Detail & Related papers (2025-03-03T08:46:22Z)
Information Re-Organization Improves Reasoning in Large Language Models [22.2946033364035]
We propose an information re-organization (InfoRE) method to enhance the reasoning ability of large language models (LLMs) Our method involves extracting logical relationships from the contextual content, such as documents or paragraphs, and subsequently pruning redundant content to minimize noise. To demonstrate the effectiveness of our approach in improving the reasoning ability, we conduct experiments using Llama2-70B, GPT-3.5, and GPT-4 on various contextually aware multi-hop reasoning tasks.
arXiv Detail & Related papers (2024-04-22T08:47:27Z)
Mitigating Misleading Chain-of-Thought Reasoning with Selective Filtering [59.495717939664246]
Large language models have manifested remarkable capabilities by leveraging chain-of-thought (CoT) reasoning techniques to solve intricate questions. We propose a novel approach called the selective filtering reasoner (SelF-Reasoner) that assesses the entailment relationship between the question and the candidate reasoning chain. SelF-Reasoner improves the fine-tuned T5 baseline consistently over the ScienceQA, ECQA, and LastLetter tasks.
arXiv Detail & Related papers (2024-03-28T06:28:35Z)
LogicAsker: Evaluating and Improving the Logical Reasoning Ability of Large Language Models [63.14196038655506]
We introduce LogicAsker, a novel approach for evaluating and enhancing the logical reasoning capabilities of large language models (LLMs) Our methodology reveals significant gaps in LLMs' learning of logical rules, with identified reasoning failures ranging from 29% to 90% across different models. We leverage these findings to construct targeted demonstration examples and fine-tune data, notably enhancing logical reasoning in models like GPT-4o by up to 5%.
arXiv Detail & Related papers (2024-01-01T13:53:53Z)
A Closer Look at the Self-Verification Abilities of Large Language Models in Logical Reasoning [73.77088902676306]
We take a closer look at the self-verification abilities of large language models (LLMs) in the context of logical reasoning. Our main findings suggest that existing LLMs could struggle to identify fallacious reasoning steps accurately and may fall short of guaranteeing the validity of self-verification methods.
arXiv Detail & Related papers (2023-11-14T07:13:10Z)
How susceptible are LLMs to Logical Fallacies? [5.723715910568911]
We present LOGICOM, a diagnostic benchmark to assess the robustness of Large Language Models against logical fallacies. We use this benchmark to evaluate the performance of GPT-3.5 and GPT-4 using a dataset containing controversial topics. Our findings indicate that both GPT-3.5 and GPT-4 can adjust their opinion through reasoning.
arXiv Detail & Related papers (2023-08-18T23:07:29Z)
Question Decomposition Improves the Faithfulness of Model-Generated Reasoning [23.34325378824462]
Large language models (LLMs) are difficult to verify the correctness and safety of their behavior. One approach is to prompt LLMs to externalize their reasoning, by having them generate step-by-step reasoning as they answer a question. This approach relies on the stated reasoning faithfully reflecting the model's actual reasoning, which is not always the case. Decomposition-based methods achieve strong performance on question-answering tasks, sometimes approaching that of CoT.
arXiv Detail & Related papers (2023-07-17T00:54:10Z)
Language Models with Rationality [57.37201135072838]
Large language models (LLMs) are proficient at question-answering (QA) It is not always clear how (or even if) an answer follows from their latent "beliefs"
arXiv Detail & Related papers (2023-05-23T17:04:25Z)
Consistency Analysis of ChatGPT [65.268245109828]
This paper investigates the trustworthiness of ChatGPT and GPT-4 regarding logically consistent behaviour. Our findings suggest that while both models appear to show an enhanced language understanding and reasoning ability, they still frequently fall short of generating logically consistent predictions.
arXiv Detail & Related papers (2023-03-11T01:19:01Z)
Faithful Reasoning Using Large Language Models [12.132449274592668]
We show how LMs can be made to perform faithful multi-step reasoning via a process whose causal structure mirrors the underlying logical structure of the problem. Our approach works by chaining together reasoning steps, where each step results from calls to two fine-tuned LMs. We demonstrate the effectiveness of our model on multi-step logical deduction and scientific question-answering, showing that it outperforms baselines on final answer accuracy.
arXiv Detail & Related papers (2022-08-30T13:44:41Z)
Evaluate Confidence Instead of Perplexity for Zero-shot Commonsense Reasoning [85.1541170468617]
This paper reconsiders the nature of commonsense reasoning and proposes a novel commonsense reasoning metric, Non-Replacement Confidence (NRC) Our proposed novel method boosts zero-shot performance on two commonsense reasoning benchmark datasets and further seven commonsense question-answering datasets.
arXiv Detail & Related papers (2022-08-23T14:42:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.