Chain-of-Thought Reasoning In The Wild Is Not Always Faithful
- URL: http://arxiv.org/abs/2503.08679v3
- Date: Wed, 19 Mar 2025 19:20:42 GMT
- Title: Chain-of-Thought Reasoning In The Wild Is Not Always Faithful
- Authors: Iván Arcuschin, Jett Janiak, Robert Krzyzanowski, Senthooran Rajamanoharan, Neel Nanda, Arthur Conmy,
- Abstract summary: Chain-of-Thought (CoT) reasoning has significantly advanced state-of-the-art AI capabilities.<n>We show that unfaithful CoT can occur on realistic prompts with no artificial bias.<n>Specifically, we find that models rationalize their implicit biases in answers to binary questions.
- Score: 3.048751803239144
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Chain-of-Thought (CoT) reasoning has significantly advanced state-of-the-art AI capabilities. However, recent studies have shown that CoT reasoning is not always faithful, i.e. CoT reasoning does not always reflect how models arrive at conclusions. So far, most of these studies have focused on unfaithfulness in unnatural contexts where an explicit bias has been introduced. In contrast, we show that unfaithful CoT can occur on realistic prompts with no artificial bias. Our results reveal non-negligible rates of several forms of unfaithful reasoning in frontier models: Sonnet 3.7 (16.3%), DeepSeek R1 (5.3%) and ChatGPT-4o (7.0%) all answer a notable proportion of question pairs unfaithfully. Specifically, we find that models rationalize their implicit biases in answers to binary questions ("implicit post-hoc rationalization"). For example, when separately presented with the questions "Is X bigger than Y?" and "Is Y bigger than X?", models sometimes produce superficially coherent arguments to justify answering Yes to both questions or No to both questions, despite such responses being logically contradictory. We also investigate restoration errors (Dziri et al., 2023), where models make and then silently correct errors in their reasoning, and unfaithful shortcuts, where models use clearly illogical reasoning to simplify solving problems in Putnam questions (a hard benchmark). Our findings raise challenges for AI safety work that relies on monitoring CoT to detect undesired behavior.
Related papers
- Beyond the Last Answer: Your Reasoning Trace Uncovers More than You Think [51.0691253204425]
We analyze intermediate reasoning steps, termed subthoughts, to answer two questions: Does the final answer reliably represent the model's optimal conclusion?
Our approach involves segmenting a reasoning trace into sequential subthoughts based on linguistic cues.
We find that aggregating these answers by selecting the most frequent one (the mode) often yields significantly higher accuracy compared to relying solely on the answer derived from the original complete trace.
arXiv Detail & Related papers (2025-04-29T12:39:07Z) - Are DeepSeek R1 And Other Reasoning Models More Faithful? [2.0429566123690455]
We evaluate three reasoning models based on Qwen-2.5, Gemini-2, and DeepSeek-V3-Base.<n>We test whether models can describe how a cue in their prompt influences their answer to MMLU questions.<n> Reasoning models describe cues that influence them much more reliably than all the non-reasoning models tested.
arXiv Detail & Related papers (2025-01-14T14:31:45Z) - Chain-of-Probe: Examing the Necessity and Accuracy of CoT Step-by-Step [81.50681925980135]
We propose a method to probe changes in the mind during the model's reasoning.
By analyzing patterns in mind change, we examine the correctness of the model's reasoning.
Our validation reveals that many responses, although correct in their final answer, contain errors in their reasoning process.
arXiv Detail & Related papers (2024-06-23T15:50:22Z) - Mitigating Misleading Chain-of-Thought Reasoning with Selective Filtering [59.495717939664246]
Large language models have manifested remarkable capabilities by leveraging chain-of-thought (CoT) reasoning techniques to solve intricate questions.
We propose a novel approach called the selective filtering reasoner (SelF-Reasoner) that assesses the entailment relationship between the question and the candidate reasoning chain.
SelF-Reasoner improves the fine-tuned T5 baseline consistently over the ScienceQA, ECQA, and LastLetter tasks.
arXiv Detail & Related papers (2024-03-28T06:28:35Z) - Measuring Faithfulness in Chain-of-Thought Reasoning [19.074147845029355]
Large language models (LLMs) perform better when they produce step-by-step, "Chain-of-Thought" (CoT) reasoning before answering a question.
It is unclear if the stated reasoning is a faithful explanation of the model's actual reasoning (i.e., its process for answering the question)
We investigate hypotheses for how CoT reasoning may be unfaithful, by examining how the model predictions change when we intervene on the CoT.
arXiv Detail & Related papers (2023-07-17T01:08:39Z) - Language Models Don't Always Say What They Think: Unfaithful
Explanations in Chain-of-Thought Prompting [43.458726163197824]
Large Language Models (LLMs) can achieve strong performance on many tasks by producing step-by-step reasoning before giving a final output.
We find that CoT explanations can systematically misrepresent the true reason for a model's prediction.
arXiv Detail & Related papers (2023-05-07T22:44:25Z) - Measuring and Narrowing the Compositionality Gap in Language Models [116.5228850227024]
We measure how often models can correctly answer all sub-problems but not generate the overall solution.
We present a new method, self-ask, that further improves on chain of thought.
arXiv Detail & Related papers (2022-10-07T06:50:23Z) - Causal Expectation-Maximisation [70.45873402967297]
We show that causal inference is NP-hard even in models characterised by polytree-shaped graphs.
We introduce the causal EM algorithm to reconstruct the uncertainty about the latent variables from data about categorical manifest variables.
We argue that there appears to be an unnoticed limitation to the trending idea that counterfactual bounds can often be computed without knowledge of the structural equations.
arXiv Detail & Related papers (2020-11-04T10:25:13Z) - Counterfactual Variable Control for Robust and Interpretable Question
Answering [57.25261576239862]
Deep neural network based question answering (QA) models are neither robust nor explainable in many cases.
In this paper, we inspect such spurious "capability" of QA models using causal inference.
We propose a novel approach called Counterfactual Variable Control (CVC) that explicitly mitigates any shortcut correlation.
arXiv Detail & Related papers (2020-10-12T10:09:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.