Evaluating Explanations: An Explanatory Virtues Framework for Mechanistic Interpretability -- The Strange Science Part I.ii
- URL: http://arxiv.org/abs/2505.01372v1
- Date: Fri, 02 May 2025 16:18:40 GMT
- Title: Evaluating Explanations: An Explanatory Virtues Framework for Mechanistic Interpretability -- The Strange Science Part I.ii
- Authors: Kola Ayonrinde, Louis Jaburi,
- Abstract summary: Mechanistic Interpretability aims to understand neural networks through causal explanations.<n>Progress has been limited by the lack of a universal approach to evaluating explanations.<n>We introduce a pluralist Explanatory Virtues Framework to systematically evaluate and improve explanations in MI.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Mechanistic Interpretability (MI) aims to understand neural networks through causal explanations. Though MI has many explanation-generating methods, progress has been limited by the lack of a universal approach to evaluating explanations. Here we analyse the fundamental question "What makes a good explanation?" We introduce a pluralist Explanatory Virtues Framework drawing on four perspectives from the Philosophy of Science - the Bayesian, Kuhnian, Deutschian, and Nomological - to systematically evaluate and improve explanations in MI. We find that Compact Proofs consider many explanatory virtues and are hence a promising approach. Fruitful research directions implied by our framework include (1) clearly defining explanatory simplicity, (2) focusing on unifying explanations and (3) deriving universal principles for neural networks. Improved MI methods enhance our ability to monitor, predict, and steer AI systems.
Related papers
- Mechanistic Interpretability Needs Philosophy [32.28998520468988]
We argue that mechanistic interpretability needs philosophy: not as an afterthought, but as an ongoing partner in clarifying its concepts.<n>This position paper illustrates the value philosophy can add to MI research, and outlines a path toward deeper interdisciplinary dialogue.
arXiv Detail & Related papers (2025-06-23T17:13:30Z) - A Mathematical Philosophy of Explanations in Mechanistic Interpretability -- The Strange Science Part I.i [0.0]
We argue that Mechanistic Interpretability research is a principled approach to understanding models.<n>We show that Explanatory Faithfulness, an assessment of how well an explanation fits a model, is well-defined.
arXiv Detail & Related papers (2025-05-01T19:08:34Z) - Explainers' Mental Representations of Explainees' Needs in Everyday Explanations [0.0]
In explanations, explainers have mental representations of explainees' developing knowledge and shifting interests regarding the explanandum.
XAI should be able to react to explainees' needs in a similar manner.
This study investigated explainers' mental representations in everyday explanations of technological artifacts.
arXiv Detail & Related papers (2024-11-13T10:53:07Z) - A Mechanistic Explanatory Strategy for XAI [0.0]
This paper outlines a mechanistic strategy for explaining the functional organization of deep learning systems.<n>The findings suggest that pursuing mechanistic explanations can uncover elements that traditional explainability techniques may overlook.
arXiv Detail & Related papers (2024-11-02T18:30:32Z) - Towards a Mechanistic Interpretation of Multi-Step Reasoning
Capabilities of Language Models [107.07851578154242]
Language models (LMs) have strong multi-step (i.e., procedural) reasoning capabilities.
It is unclear whether LMs perform tasks by cheating with answers memorized from pretraining corpus, or, via a multi-step reasoning mechanism.
We show that MechanisticProbe is able to detect the information of the reasoning tree from the model's attentions for most examples.
arXiv Detail & Related papers (2023-10-23T01:47:29Z) - Explainability for Large Language Models: A Survey [59.67574757137078]
Large language models (LLMs) have demonstrated impressive capabilities in natural language processing.
This paper introduces a taxonomy of explainability techniques and provides a structured overview of methods for explaining Transformer-based language models.
arXiv Detail & Related papers (2023-09-02T22:14:26Z) - A Theoretical Framework for AI Models Explainability with Application in
Biomedicine [3.5742391373143474]
We propose a novel definition of explanation that is a synthesis of what can be found in the literature.
We fit explanations into the properties of faithfulness (i.e., the explanation being a true description of the model's inner workings and decision-making process) and plausibility (i.e., how much the explanation looks convincing to the user)
arXiv Detail & Related papers (2022-12-29T20:05:26Z) - NELLIE: A Neuro-Symbolic Inference Engine for Grounded, Compositional, and Explainable Reasoning [59.16962123636579]
This paper proposes a new take on Prolog-based inference engines.
We replace handcrafted rules with a combination of neural language modeling, guided generation, and semi dense retrieval.
Our implementation, NELLIE, is the first system to demonstrate fully interpretable, end-to-end grounded QA.
arXiv Detail & Related papers (2022-09-16T00:54:44Z) - Scientific Explanation and Natural Language: A Unified
Epistemological-Linguistic Perspective for Explainable AI [2.7920304852537536]
This paper focuses on the scientific domain, aiming to bridge the gap between theory and practice on the notion of a scientific explanation.
Through a mixture of quantitative and qualitative methodologies, the presented study allows deriving the following main conclusions.
arXiv Detail & Related papers (2022-05-03T22:31:42Z) - Towards Interpretable Natural Language Understanding with Explanations
as Latent Variables [146.83882632854485]
We develop a framework for interpretable natural language understanding that requires only a small set of human annotated explanations for training.
Our framework treats natural language explanations as latent variables that model the underlying reasoning process of a neural model.
arXiv Detail & Related papers (2020-10-24T02:05:56Z) - The Struggles of Feature-Based Explanations: Shapley Values vs. Minimal
Sufficient Subsets [61.66584140190247]
We show that feature-based explanations pose problems even for explaining trivial models.
We show that two popular classes of explainers, Shapley explainers and minimal sufficient subsets explainers, target fundamentally different types of ground-truth explanations.
arXiv Detail & Related papers (2020-09-23T09:45:23Z) - A general framework for scientifically inspired explanations in AI [76.48625630211943]
We instantiate the concept of structure of scientific explanation as the theoretical underpinning for a general framework in which explanations for AI systems can be implemented.
This framework aims to provide the tools to build a "mental-model" of any AI system so that the interaction with the user can provide information on demand and be closer to the nature of human-made explanations.
arXiv Detail & Related papers (2020-03-02T10:32:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.