Related papers: A Mathematical Philosophy of Explanations in Mechanistic Interpretability -- The Strange Science Part I.i

A Mathematical Philosophy of Explanations in Mechanistic Interpretability -- The Strange Science Part I.i

URL: http://arxiv.org/abs/2505.00808v1
Date: Thu, 01 May 2025 19:08:34 GMT
Title: A Mathematical Philosophy of Explanations in Mechanistic Interpretability -- The Strange Science Part I.i
Authors: Kola Ayonrinde, Louis Jaburi,
Abstract summary: We argue that Mechanistic Interpretability research is a principled approach to understanding models.<n>We show that Explanatory Faithfulness, an assessment of how well an explanation fits a model, is well-defined.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Mechanistic Interpretability aims to understand neural networks through causal explanations. We argue for the Explanatory View Hypothesis: that Mechanistic Interpretability research is a principled approach to understanding models because neural networks contain implicit explanations which can be extracted and understood. We hence show that Explanatory Faithfulness, an assessment of how well an explanation fits a model, is well-defined. We propose a definition of Mechanistic Interpretability (MI) as the practice of producing Model-level, Ontic, Causal-Mechanistic, and Falsifiable explanations of neural networks, allowing us to distinguish MI from other interpretability paradigms and detail MI's inherent limits. We formulate the Principle of Explanatory Optimism, a conjecture which we argue is a necessary precondition for the success of Mechanistic Interpretability.

Related papers

Mechanistic Interpretability Needs Philosophy [32.28998520468988]
We argue that mechanistic interpretability needs philosophy: not as an afterthought, but as an ongoing partner in clarifying its concepts.<n>This position paper illustrates the value philosophy can add to MI research, and outlines a path toward deeper interdisciplinary dialogue.
arXiv Detail & Related papers (2025-06-23T17:13:30Z)
Is the end of Insight in Sight ? [0.0]
A physics-informed neural network (PINN) trained on a rarefied gas dynamics problem governed by the Boltzmann equation.<n>Despite the system's clear structure and well-understood governing laws, the trained network's weights resemble Gaussian-distributed random matrices.<n>This suggests that deep learning and traditional simulation may follow distinct cognitive paths to the same outcome.
arXiv Detail & Related papers (2025-05-07T19:57:36Z)
Evaluating Explanations: An Explanatory Virtues Framework for Mechanistic Interpretability -- The Strange Science Part I.ii [0.0]
Mechanistic Interpretability aims to understand neural networks through causal explanations.<n>Progress has been limited by the lack of a universal approach to evaluating explanations.<n>We introduce a pluralist Explanatory Virtues Framework to systematically evaluate and improve explanations in MI.
arXiv Detail & Related papers (2025-05-02T16:18:40Z)
Validating Mechanistic Interpretations: An Axiomatic Approach [28.336108192282737]
We give a set of axioms that formally characterize a mechanistic interpretation as a description that approximately captures the semantics of the neural network under analysis in a compositional manner.<n>We demonstrate the applicability of these axioms for validating mechanistic interpretations on an existing, well-known interpretability study as well as on a new case study involving a Transformer-based model trained to solve the well-known 2-SAT problem.
arXiv Detail & Related papers (2024-07-18T15:32:44Z)
Explainability for Large Language Models: A Survey [59.67574757137078]
Large language models (LLMs) have demonstrated impressive capabilities in natural language processing. This paper introduces a taxonomy of explainability techniques and provides a structured overview of methods for explaining Transformer-based language models.
arXiv Detail & Related papers (2023-09-02T22:14:26Z)
A Theoretical Framework for AI Models Explainability with Application in Biomedicine [3.5742391373143474]
We propose a novel definition of explanation that is a synthesis of what can be found in the literature. We fit explanations into the properties of faithfulness (i.e., the explanation being a true description of the model's inner workings and decision-making process) and plausibility (i.e., how much the explanation looks convincing to the user)
arXiv Detail & Related papers (2022-12-29T20:05:26Z)
MEGAN: Multi-Explanation Graph Attention Network [1.1470070927586016]
We propose a multi-explanation graph attention network (MEGAN) Unlike existing graph explainability methods, our network can produce node and edge attributional explanations along multiple channels. Our attention-based network is fully differentiable and explanations can actively be trained in an explanation-supervised manner.
arXiv Detail & Related papers (2022-11-23T16:10:13Z)
Logical Satisfiability of Counterfactuals for Faithful Explanations in NLI [60.142926537264714]
We introduce the methodology of Faithfulness-through-Counterfactuals. It generates a counterfactual hypothesis based on the logical predicates expressed in the explanation. It then evaluates if the model's prediction on the counterfactual is consistent with that expressed logic.
arXiv Detail & Related papers (2022-05-25T03:40:59Z)
Scientific Explanation and Natural Language: A Unified Epistemological-Linguistic Perspective for Explainable AI [2.7920304852537536]
This paper focuses on the scientific domain, aiming to bridge the gap between theory and practice on the notion of a scientific explanation. Through a mixture of quantitative and qualitative methodologies, the presented study allows deriving the following main conclusions.
arXiv Detail & Related papers (2022-05-03T22:31:42Z)
Quantum realism: axiomatization and quantification [77.34726150561087]
We build an axiomatization for quantum realism -- a notion of realism compatible with quantum theory. We explicitly construct some classes of entropic quantifiers that are shown to satisfy almost all of the proposed axioms.
arXiv Detail & Related papers (2021-10-10T18:08:42Z)
CX-ToM: Counterfactual Explanations with Theory-of-Mind for Enhancing Human Trust in Image Recognition Models [84.32751938563426]
We propose a new explainable AI (XAI) framework for explaining decisions made by a deep convolutional neural network (CNN) In contrast to the current methods in XAI that generate explanations as a single shot response, we pose explanation as an iterative communication process. Our framework generates sequence of explanations in a dialog by mediating the differences between the minds of machine and human user.
arXiv Detail & Related papers (2021-09-03T09:46:20Z)
The Struggles of Feature-Based Explanations: Shapley Values vs. Minimal Sufficient Subsets [61.66584140190247]
We show that feature-based explanations pose problems even for explaining trivial models. We show that two popular classes of explainers, Shapley explainers and minimal sufficient subsets explainers, target fundamentally different types of ground-truth explanations.
arXiv Detail & Related papers (2020-09-23T09:45:23Z)
Logical Neural Networks [51.46602187496816]
We propose a novel framework seamlessly providing key properties of both neural nets (learning) and symbolic logic (knowledge and reasoning) Every neuron has a meaning as a component of a formula in a weighted real-valued logic, yielding a highly intepretable disentangled representation. Inference is omni rather than focused on predefined target variables, and corresponds to logical reasoning.
arXiv Detail & Related papers (2020-06-23T16:55:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.