Related papers: Do Models Explain Themselves? Counterfactual Simulatability of Natural Language Explanations

Do Models Explain Themselves? Counterfactual Simulatability of Natural Language Explanations

URL: http://arxiv.org/abs/2307.08678v1
Date: Mon, 17 Jul 2023 17:41:47 GMT
Title: Do Models Explain Themselves? Counterfactual Simulatability of Natural Language Explanations
Authors: Yanda Chen, Ruiqi Zhong, Narutatsu Ri, Chen Zhao, He He, Jacob Steinhardt, Zhou Yu, Kathleen McKeown
Abstract summary: Large language models (LLMs) are trained to imitate humans to explain human decisions. We evaluate whether an explanation can enable humans to precisely infer the model's outputs on diverse counterfactuals. We found that LLM's explanations have low precision and that precision does not correlate with plausibility.
Score: 62.61495090463084
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) are trained to imitate humans to explain human decisions. However, do LLMs explain themselves? Can they help humans build mental models of how LLMs process different inputs? To answer these questions, we propose to evaluate $\textbf{counterfactual simulatability}$ of natural language explanations: whether an explanation can enable humans to precisely infer the model's outputs on diverse counterfactuals of the explained input. For example, if a model answers "yes" to the input question "Can eagles fly?" with the explanation "all birds can fly", then humans would infer from the explanation that it would also answer "yes" to the counterfactual input "Can penguins fly?". If the explanation is precise, then the model's answer should match humans' expectations. We implemented two metrics based on counterfactual simulatability: precision and generality. We generated diverse counterfactuals automatically using LLMs. We then used these metrics to evaluate state-of-the-art LLMs (e.g., GPT-4) on two tasks: multi-hop factual reasoning and reward modeling. We found that LLM's explanations have low precision and that precision does not correlate with plausibility. Therefore, naively optimizing human approvals (e.g., RLHF) may not be a sufficient solution.

Related papers

Walk the Talk? Measuring the Faithfulness of Large Language Model Explanations [0.8949668577519213]
Large language models (LLMs) are capable of generating plausible explanations of how they arrived at an answer to a question. These explanations can misrepresent the model's "reasoning" process, i.e., they can be unfaithful. We introduce a new approach for measuring the faithfulness of LLM explanations.
arXiv Detail & Related papers (2025-04-19T02:51:20Z)
Do Large Language Models Exhibit Spontaneous Rational Deception? [0.913127392774573]
Large Language Models (LLMs) are effective at deceiving, when prompted to do so. But under what conditions do they deceive spontaneously? This study evaluates spontaneous deception produced by LLMs in a preregistered experimental protocol.
arXiv Detail & Related papers (2025-03-31T23:10:56Z)
I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data? [79.01538178959726]
Large language models (LLMs) have led many to conclude that they exhibit a form of intelligence. We introduce a novel generative model that generates tokens on the basis of human interpretable concepts represented as latent discrete variables.
arXiv Detail & Related papers (2025-03-12T01:21:17Z)
P-FOLIO: Evaluating and Improving Logical Reasoning with Abundant Human-Written Reasoning Chains [97.25943550933829]
We present P-FOLIO, a human-annotated dataset consisting of diverse and complex reasoning chains. We use P-FOLIO to evaluate and improve large-language-model (LLM) reasoning capabilities.
arXiv Detail & Related papers (2024-10-11T19:22:57Z)
Explanation sensitivity to the randomness of large language models: the case of journalistic text classification [6.240875403446504]
We study the effect of random elements in the training of large language models on the explainability of their predictions. Using a fine-tuned CamemBERT model and an explanation method based on relevance propagation, we find that training with different random seeds produces models with similar accuracy but variable explanations.
arXiv Detail & Related papers (2024-10-07T14:39:45Z)
Comparing zero-shot self-explanations with human rationales in multilingual text classification [5.32539007352208]
Instruction-tuned LLMs generate self-explanations that do not require computations or the application of possibly complex XAI methods. We analyse whether this ability results in a good explanation by evaluating self-explanations in the form of input rationales. Our results show that self-explanations align more closely with human annotations compared to LRP, while maintaining a comparable level of faithfulness.
arXiv Detail & Related papers (2024-10-04T10:14:12Z)
Towards Consistent Natural-Language Explanations via Explanation-Consistency Finetuning [66.87754065127714]
Large language models (LLMs) often generate convincing, fluent explanations. They often generate inconsistent explanations on different inputs. We propose explanation-consistency finetuning (EC-finetuning) to generate consistent natural-language explanations.
arXiv Detail & Related papers (2024-01-25T07:04:30Z)
Towards a Mechanistic Interpretation of Multi-Step Reasoning Capabilities of Language Models [107.07851578154242]
Language models (LMs) have strong multi-step (i.e., procedural) reasoning capabilities. It is unclear whether LMs perform tasks by cheating with answers memorized from pretraining corpus, or, via a multi-step reasoning mechanism. We show that MechanisticProbe is able to detect the information of the reasoning tree from the model's attentions for most examples.
arXiv Detail & Related papers (2023-10-23T01:47:29Z)
Can Large Language Models Explain Themselves? A Study of LLM-Generated Self-Explanations [14.685170467182369]
Large language models (LLMs) such as ChatGPT have demonstrated superior performance on a variety of natural language processing (NLP) tasks. Since these models are instruction-tuned on human conversations to produce "helpful" responses, they can and often will produce explanations along with the response.
arXiv Detail & Related papers (2023-10-17T12:34:32Z)
Learning to Scaffold: Optimizing Model Explanations for Teaching [74.25464914078826]
We train models on three natural language processing and computer vision tasks. We find that students trained with explanations extracted with our framework are able to simulate the teacher significantly more effectively than ones produced with previous methods.
arXiv Detail & Related papers (2022-04-22T16:43:39Z)
Leakage-Adjusted Simulatability: Can Models Generate Non-Trivial Explanations of Their Behavior in Natural Language? [86.60613602337246]
We introduce a leakage-adjusted simulatability (LAS) metric for evaluating NL explanations. LAS measures how well explanations help an observer predict a model's output, while controlling for how explanations can directly leak the output. We frame explanation generation as a multi-agent game and optimize explanations for simulatability while penalizing label leakage.
arXiv Detail & Related papers (2020-10-08T16:59:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.