Language Models Don't Always Say What They Think: Unfaithful
Explanations in Chain-of-Thought Prompting
- URL: http://arxiv.org/abs/2305.04388v2
- Date: Sat, 9 Dec 2023 21:25:02 GMT
- Title: Language Models Don't Always Say What They Think: Unfaithful
Explanations in Chain-of-Thought Prompting
- Authors: Miles Turpin, Julian Michael, Ethan Perez, Samuel R. Bowman
- Abstract summary: Large Language Models (LLMs) can achieve strong performance on many tasks by producing step-by-step reasoning before giving a final output.
We find that CoT explanations can systematically misrepresent the true reason for a model's prediction.
- Score: 43.458726163197824
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) can achieve strong performance on many tasks by
producing step-by-step reasoning before giving a final output, often referred
to as chain-of-thought reasoning (CoT). It is tempting to interpret these CoT
explanations as the LLM's process for solving a task. This level of
transparency into LLMs' predictions would yield significant safety benefits.
However, we find that CoT explanations can systematically misrepresent the true
reason for a model's prediction. We demonstrate that CoT explanations can be
heavily influenced by adding biasing features to model inputs--e.g., by
reordering the multiple-choice options in a few-shot prompt to make the answer
always "(A)"--which models systematically fail to mention in their
explanations. When we bias models toward incorrect answers, they frequently
generate CoT explanations rationalizing those answers. This causes accuracy to
drop by as much as 36% on a suite of 13 tasks from BIG-Bench Hard, when testing
with GPT-3.5 from OpenAI and Claude 1.0 from Anthropic. On a social-bias task,
model explanations justify giving answers in line with stereotypes without
mentioning the influence of these social biases. Our findings indicate that CoT
explanations can be plausible yet misleading, which risks increasing our trust
in LLMs without guaranteeing their safety. Building more transparent and
explainable systems will require either improving CoT faithfulness through
targeted efforts or abandoning CoT in favor of alternative methods.
Related papers
- Deciphering the Factors Influencing the Efficacy of Chain-of-Thought: Probability, Memorization, and Noisy Reasoning [11.758019716526459]
Chain-of-Thought (CoT) prompting has been shown to enhance the multi-step reasoning capabilities of Large Language Models (LLMs)
We show that CoT prompting performance reflects both memorization and a probabilistic version of genuine reasoning.
arXiv Detail & Related papers (2024-07-01T18:01:07Z) - Chain-of-Probe: Examing the Necessity and Accuracy of CoT Step-by-Step [81.50681925980135]
We propose a method to probe changes in the mind during the model's reasoning.
By analyzing patterns in mind change, we examine the correctness of the model's reasoning.
Our validation reveals that many responses, although correct in their final answer, contain errors in their reasoning process.
arXiv Detail & Related papers (2024-06-23T15:50:22Z) - A Hopfieldian View-based Interpretation for Chain-of-Thought Reasoning [48.51969964676017]
Chain-of-Thought (CoT) holds a significant place in augmenting the reasoning performance for large language models.
We propose a Read-and-Control approach for controlling the accuracy of CoT.
arXiv Detail & Related papers (2024-06-18T04:07:13Z) - Mitigating Misleading Chain-of-Thought Reasoning with Selective Filtering [59.495717939664246]
Large language models have manifested remarkable capabilities by leveraging chain-of-thought (CoT) reasoning techniques to solve intricate questions.
We propose a novel approach called the selective filtering reasoner (SelF-Reasoner) that assesses the entailment relationship between the question and the candidate reasoning chain.
SelF-Reasoner improves the fine-tuned T5 baseline consistently over the ScienceQA, ECQA, and LastLetter tasks.
arXiv Detail & Related papers (2024-03-28T06:28:35Z) - Measuring Faithfulness in Chain-of-Thought Reasoning [19.074147845029355]
Large language models (LLMs) perform better when they produce step-by-step, "Chain-of-Thought" (CoT) reasoning before answering a question.
It is unclear if the stated reasoning is a faithful explanation of the model's actual reasoning (i.e., its process for answering the question)
We investigate hypotheses for how CoT reasoning may be unfaithful, by examining how the model predictions change when we intervene on the CoT.
arXiv Detail & Related papers (2023-07-17T01:08:39Z) - Question Decomposition Improves the Faithfulness of Model-Generated
Reasoning [23.34325378824462]
Large language models (LLMs) are difficult to verify the correctness and safety of their behavior.
One approach is to prompt LLMs to externalize their reasoning, by having them generate step-by-step reasoning as they answer a question.
This approach relies on the stated reasoning faithfully reflecting the model's actual reasoning, which is not always the case.
Decomposition-based methods achieve strong performance on question-answering tasks, sometimes approaching that of CoT.
arXiv Detail & Related papers (2023-07-17T00:54:10Z) - SCOTT: Self-Consistent Chain-of-Thought Distillation [68.40232422158569]
Large language models (LMs) generate free-text rationales for their predictions via chain-of-thought prompting.
We propose a faithful knowledge distillation method to learn a small, self-consistent CoT model from a teacher model that is orders of magnitude larger.
To ensure faithful distillation, we use the teacher-generated rationales to learn a student LM with a counterfactual reasoning objective.
arXiv Detail & Related papers (2023-05-03T03:47:00Z) - Large Language Models are Better Reasoners with Self-Verification [48.534270563880845]
Large language models (LLMs) have shown strong reasoning ability in several natural language processing tasks.
LLMs with chain of thought (CoT) prompting require multi-step prompting and multi-token prediction, which is highly sensitive to individual mistakes.
We propose and prove that LLMs also have similar self-verification abilities.
arXiv Detail & Related papers (2022-12-19T15:51:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.