Do Models Explain Themselves? Counterfactual Simulatability of Natural
Language Explanations
- URL: http://arxiv.org/abs/2307.08678v1
- Date: Mon, 17 Jul 2023 17:41:47 GMT
- Title: Do Models Explain Themselves? Counterfactual Simulatability of Natural
Language Explanations
- Authors: Yanda Chen, Ruiqi Zhong, Narutatsu Ri, Chen Zhao, He He, Jacob
Steinhardt, Zhou Yu, Kathleen McKeown
- Abstract summary: Large language models (LLMs) are trained to imitate humans to explain human decisions.
We evaluate whether an explanation can enable humans to precisely infer the model's outputs on diverse counterfactuals.
We found that LLM's explanations have low precision and that precision does not correlate with plausibility.
- Score: 62.61495090463084
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) are trained to imitate humans to explain human
decisions. However, do LLMs explain themselves? Can they help humans build
mental models of how LLMs process different inputs? To answer these questions,
we propose to evaluate $\textbf{counterfactual simulatability}$ of natural
language explanations: whether an explanation can enable humans to precisely
infer the model's outputs on diverse counterfactuals of the explained input.
For example, if a model answers "yes" to the input question "Can eagles fly?"
with the explanation "all birds can fly", then humans would infer from the
explanation that it would also answer "yes" to the counterfactual input "Can
penguins fly?". If the explanation is precise, then the model's answer should
match humans' expectations.
We implemented two metrics based on counterfactual simulatability: precision
and generality. We generated diverse counterfactuals automatically using LLMs.
We then used these metrics to evaluate state-of-the-art LLMs (e.g., GPT-4) on
two tasks: multi-hop factual reasoning and reward modeling. We found that LLM's
explanations have low precision and that precision does not correlate with
plausibility. Therefore, naively optimizing human approvals (e.g., RLHF) may
not be a sufficient solution.
Related papers
- P-FOLIO: Evaluating and Improving Logical Reasoning with Abundant Human-Written Reasoning Chains [97.25943550933829]
We present P-FOLIO, a human-annotated dataset consisting of diverse and complex reasoning chains.
We use P-FOLIO to evaluate and improve large-language-model (LLM) reasoning capabilities.
arXiv Detail & Related papers (2024-10-11T19:22:57Z) - Explanation sensitivity to the randomness of large language models: the case of journalistic text classification [6.240875403446504]
We study the effect of random elements in the training of large language models on the explainability of their predictions.
Using a fine-tuned CamemBERT model and an explanation method based on relevance propagation, we find that training with different random seeds produces models with similar accuracy but variable explanations.
arXiv Detail & Related papers (2024-10-07T14:39:45Z) - Comparing zero-shot self-explanations with human rationales in multilingual text classification [5.32539007352208]
Instruction-tuned LLMs generate self-explanations that do not require computations or the application of possibly complex XAI methods.
We analyse whether this ability results in a good explanation by evaluating self-explanations in the form of input rationales.
Our results show that self-explanations align more closely with human annotations compared to LRP, while maintaining a comparable level of faithfulness.
arXiv Detail & Related papers (2024-10-04T10:14:12Z) - Towards Consistent Natural-Language Explanations via
Explanation-Consistency Finetuning [66.87754065127714]
Large language models (LLMs) often generate convincing, fluent explanations.
They often generate inconsistent explanations on different inputs.
We propose explanation-consistency finetuning (EC-finetuning) to generate consistent natural-language explanations.
arXiv Detail & Related papers (2024-01-25T07:04:30Z) - Towards a Mechanistic Interpretation of Multi-Step Reasoning
Capabilities of Language Models [107.07851578154242]
Language models (LMs) have strong multi-step (i.e., procedural) reasoning capabilities.
It is unclear whether LMs perform tasks by cheating with answers memorized from pretraining corpus, or, via a multi-step reasoning mechanism.
We show that MechanisticProbe is able to detect the information of the reasoning tree from the model's attentions for most examples.
arXiv Detail & Related papers (2023-10-23T01:47:29Z) - Can Large Language Models Explain Themselves? A Study of LLM-Generated
Self-Explanations [14.685170467182369]
Large language models (LLMs) such as ChatGPT have demonstrated superior performance on a variety of natural language processing (NLP) tasks.
Since these models are instruction-tuned on human conversations to produce "helpful" responses, they can and often will produce explanations along with the response.
arXiv Detail & Related papers (2023-10-17T12:34:32Z) - Learning to Scaffold: Optimizing Model Explanations for Teaching [74.25464914078826]
We train models on three natural language processing and computer vision tasks.
We find that students trained with explanations extracted with our framework are able to simulate the teacher significantly more effectively than ones produced with previous methods.
arXiv Detail & Related papers (2022-04-22T16:43:39Z) - Leakage-Adjusted Simulatability: Can Models Generate Non-Trivial
Explanations of Their Behavior in Natural Language? [86.60613602337246]
We introduce a leakage-adjusted simulatability (LAS) metric for evaluating NL explanations.
LAS measures how well explanations help an observer predict a model's output, while controlling for how explanations can directly leak the output.
We frame explanation generation as a multi-agent game and optimize explanations for simulatability while penalizing label leakage.
arXiv Detail & Related papers (2020-10-08T16:59:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.