Interpretability Needs a New Paradigm
- URL: http://arxiv.org/abs/2405.05386v1
- Date: Wed, 8 May 2024 19:31:06 GMT
- Title: Interpretability Needs a New Paradigm
- Authors: Andreas Madsen, Himabindu Lakkaraju, Siva Reddy, Sarath Chandar,
- Abstract summary: Interpretability is the study of explaining models in understandable terms to humans.
At the core of this debate is how each paradigm ensures its explanations are faithful, i.e., true to the model's behavior.
This paper's position is that we should think about new paradigms while staying vigilant regarding faithfulness.
- Score: 49.134097841837715
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Interpretability is the study of explaining models in understandable terms to humans. At present, interpretability is divided into two paradigms: the intrinsic paradigm, which believes that only models designed to be explained can be explained, and the post-hoc paradigm, which believes that black-box models can be explained. At the core of this debate is how each paradigm ensures its explanations are faithful, i.e., true to the model's behavior. This is important, as false but convincing explanations lead to unsupported confidence in artificial intelligence (AI), which can be dangerous. This paper's position is that we should think about new paradigms while staying vigilant regarding faithfulness. First, by examining the history of paradigms in science, we see that paradigms are constantly evolving. Then, by examining the current paradigms, we can understand their underlying beliefs, the value they bring, and their limitations. Finally, this paper presents 3 emerging paradigms for interpretability. The first paradigm designs models such that faithfulness can be easily measured. Another optimizes models such that explanations become faithful. The last paradigm proposes to develop models that produce both a prediction and an explanation.
Related papers
- Faithful Model Explanations through Energy-Constrained Conformal
Counterfactuals [16.67633872254042]
Counterfactual explanations offer an intuitive and straightforward way to explain black-box models.
Existing work has primarily relied on surrogate models to learn how the input data is distributed.
We propose a novel algorithmic framework for generating Energy-Constrained Conformal Counterfactuals that are only as plausible as the model permits.
arXiv Detail & Related papers (2023-12-17T08:24:44Z) - Explainability for Large Language Models: A Survey [59.67574757137078]
Large language models (LLMs) have demonstrated impressive capabilities in natural language processing.
This paper introduces a taxonomy of explainability techniques and provides a structured overview of methods for explaining Transformer-based language models.
arXiv Detail & Related papers (2023-09-02T22:14:26Z) - Overthinking the Truth: Understanding how Language Models Process False
Demonstrations [32.29658741345911]
We study harmful imitation through the lens of a model's internal representations.
We identify two related phenomena: "overthinking" and "false induction heads"
arXiv Detail & Related papers (2023-07-18T17:56:50Z) - Eight challenges in developing theory of intelligence [3.0349733976070024]
A good theory of mathematical beauty is more practical than any current observation, as new predictions of physical reality can be verified self-consistently.
Here, we shed light on eight challenges in developing theory of intelligence following this theoretical paradigm.
arXiv Detail & Related papers (2023-06-20T01:45:42Z) - Beware the Rationalization Trap! When Language Model Explainability
Diverges from our Mental Models of Language [9.501243481182351]
Language models learn and represent language differently than humans; they learn the form and not the meaning.
To assess the success of language model explainability, we need to consider the impact of its divergence from a user's mental model of language.
arXiv Detail & Related papers (2022-07-14T13:26:03Z) - Logical Satisfiability of Counterfactuals for Faithful Explanations in
NLI [60.142926537264714]
We introduce the methodology of Faithfulness-through-Counterfactuals.
It generates a counterfactual hypothesis based on the logical predicates expressed in the explanation.
It then evaluates if the model's prediction on the counterfactual is consistent with that expressed logic.
arXiv Detail & Related papers (2022-05-25T03:40:59Z) - Learning to Scaffold: Optimizing Model Explanations for Teaching [74.25464914078826]
We train models on three natural language processing and computer vision tasks.
We find that students trained with explanations extracted with our framework are able to simulate the teacher significantly more effectively than ones produced with previous methods.
arXiv Detail & Related papers (2022-04-22T16:43:39Z) - Do Language Models Have Beliefs? Methods for Detecting, Updating, and
Visualizing Model Beliefs [76.6325846350907]
Dennett (1995) famously argues that even thermostats have beliefs, on the view that a belief is simply an informational state decoupled from any motivational state.
In this paper, we discuss approaches to detecting when models have beliefs about the world, and we improve on methods for updating model beliefs to be more truthful.
arXiv Detail & Related papers (2021-11-26T18:33:59Z) - Modeling Event Plausibility with Consistent Conceptual Abstraction [29.69958315418181]
We show that Transformer-based plausibility models are markedly inconsistent across the conceptual classes of a lexical hierarchy.
We present a simple post-hoc method of forcing model consistency that improves correlation with human plausibility.
arXiv Detail & Related papers (2021-04-20T21:08:32Z) - The Struggles of Feature-Based Explanations: Shapley Values vs. Minimal
Sufficient Subsets [61.66584140190247]
We show that feature-based explanations pose problems even for explaining trivial models.
We show that two popular classes of explainers, Shapley explainers and minimal sufficient subsets explainers, target fundamentally different types of ground-truth explanations.
arXiv Detail & Related papers (2020-09-23T09:45:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.