Related papers: New Faithfulness-Centric Interpretability Paradigms for Natural Language Processing

New Faithfulness-Centric Interpretability Paradigms for Natural Language Processing

URL: http://arxiv.org/abs/2411.17992v1
Date: Wed, 27 Nov 2024 02:17:34 GMT
Title: New Faithfulness-Centric Interpretability Paradigms for Natural Language Processing
Authors: Andreas Madsen,
Abstract summary: This thesis investigates the question "How to provide and ensure faithful explanations for complex general-purpose neural NLP models?"<n>The two new paradigms explored are faithfulness measurable models (FMMs) and self-explanations.<n>We find that FMMs yield explanations that are near theoretical optimal in terms of faithfulness.
Score: 4.813533076849816
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: As machine learning becomes more widespread and is used in more critical applications, it's important to provide explanations for these models, to prevent unintended behavior. Unfortunately, many current interpretability methods struggle with faithfulness. Therefore, this Ph.D. thesis investigates the question "How to provide and ensure faithful explanations for complex general-purpose neural NLP models?" The main thesis is that we should develop new paradigms in interpretability. This is achieved by first developing solid faithfulness metrics and then applying the lessons learned from this investigation to develop new paradigms. The two new paradigms explored are faithfulness measurable models (FMMs) and self-explanations. The idea in self-explanations is to have large language models explain themselves, we identify that current models are not capable of doing this consistently. However, we suggest how this could be achieved. The idea of FMMs is to create models that are designed such that measuring faithfulness is cheap and precise. This makes it possible to optimize an explanation towards maximum faithfulness, which makes FMMs designed to be explained. We find that FMMs yield explanations that are near theoretical optimal in terms of faithfulness. Overall, from all investigations of faithfulness, results show that post-hoc and intrinsic explanations are by default model and task-dependent. However, this was not the case when using FMMs, even with the same post-hoc explanation methods. This shows, that even simple modifications to the model, such as randomly masking the training dataset, as was done in FMMs, can drastically change the situation and result in consistently faithful explanations. This answers the question of how to provide and ensure faithful explanations.

Related papers

Walk the Talk? Measuring the Faithfulness of Large Language Model Explanations [0.8949668577519213]
Large language models (LLMs) are capable of generating plausible explanations of how they arrived at an answer to a question. These explanations can misrepresent the model's "reasoning" process, i.e., they can be unfaithful. We introduce a new approach for measuring the faithfulness of LLM explanations.
arXiv Detail & Related papers (2025-04-19T02:51:20Z)
I Predict Therefore I Am: Is Next Token Prediction Enough to Learn Human-Interpretable Concepts from Data? [76.15163242945813]
Large language models (LLMs) have led many to conclude that they exhibit a form of intelligence.<n>We introduce a novel generative model that generates tokens on the basis of human-interpretable concepts represented as latent discrete variables.
arXiv Detail & Related papers (2025-03-12T01:21:17Z)
Towards Faithful Natural Language Explanations: A Study Using Activation Patching in Large Language Models [29.67884478799914]
Large Language Models (LLMs) are capable of generating persuasive Natural Language Explanations (NLEs) to justify their answers. Recent studies have proposed various methods to measure the faithfulness of NLEs, typically by inserting perturbations at the explanation or feature level. We argue that these approaches are neither comprehensive nor correctly designed according to the established definition of faithfulness.
arXiv Detail & Related papers (2024-10-18T03:45:42Z)
Evaluating the Reliability of Self-Explanations in Large Language Models [2.8894038270224867]
We evaluate two kinds of such self-explanations - extractive and counterfactual. Our findings reveal, that, while these self-explanations can correlate with human judgement, they do not fully and accurately follow the model's decision process. We show that this gap can be bridged because prompting LLMs for counterfactual explanations can produce faithful, informative, and easy-to-verify results.
arXiv Detail & Related papers (2024-07-19T17:41:08Z)
Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode. We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z)
Interpretability Needs a New Paradigm [49.134097841837715]
Interpretability is the study of explaining models in understandable terms to humans. At the core of this debate is how each paradigm ensures its explanations are faithful, i.e., true to the model's behavior. This paper's position is that we should think about new paradigms while staying vigilant regarding faithfulness.
arXiv Detail & Related papers (2024-05-08T19:31:06Z)
Are self-explanations from Large Language Models faithful? [35.40666730867487]
Large Language Models (LLMs) excel at many tasks and will even explain their reasoning, so-called self-explanations. It's important to measure if self-explanations truly reflect the model's behavior. We propose employing self-consistency checks to measure faithfulness.
arXiv Detail & Related papers (2024-01-15T19:39:15Z)
Faithful Model Explanations through Energy-Constrained Conformal Counterfactuals [16.67633872254042]
Counterfactual explanations offer an intuitive and straightforward way to explain black-box models. Existing work has primarily relied on surrogate models to learn how the input data is distributed. We propose a novel algorithmic framework for generating Energy-Constrained Conformal Counterfactuals that are only as plausible as the model permits.
arXiv Detail & Related papers (2023-12-17T08:24:44Z)
Faithfulness Tests for Natural Language Explanations [87.01093277918599]
Explanations of neural models aim to reveal a model's decision-making process for its predictions. Recent work shows that current methods giving explanations such as saliency maps or counterfactuals can be misleading. This work explores the challenging question of evaluating the faithfulness of natural language explanations.
arXiv Detail & Related papers (2023-05-29T11:40:37Z)
Logical Satisfiability of Counterfactuals for Faithful Explanations in NLI [60.142926537264714]
We introduce the methodology of Faithfulness-through-Counterfactuals. It generates a counterfactual hypothesis based on the logical predicates expressed in the explanation. It then evaluates if the model's prediction on the counterfactual is consistent with that expressed logic.
arXiv Detail & Related papers (2022-05-25T03:40:59Z)
Why Do Pretrained Language Models Help in Downstream Tasks? An Analysis of Head and Prompt Tuning [66.44344616836158]
We propose an analysis framework that links the pretraining and downstream tasks with an underlying latent variable generative model of text. We show that 1) under certain non-degeneracy conditions on the HMM, simple classification heads can solve the downstream task, 2) prompt tuning obtains downstream guarantees with weaker non-degeneracy conditions, and 3) our recovery guarantees for the memory-augmented HMM are stronger than for the vanilla HMM.
arXiv Detail & Related papers (2021-06-17T03:31:47Z)
Prompting Contrastive Explanations for Commonsense Reasoning Tasks [74.7346558082693]
Large pretrained language models (PLMs) can achieve near-human performance on commonsense reasoning tasks. We show how to use these same models to generate human-interpretable evidence.
arXiv Detail & Related papers (2021-06-12T17:06:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.