Faithful and Stable Neuron Explanations for Trustworthy Mechanistic Interpretability
- URL: http://arxiv.org/abs/2512.18092v1
- Date: Fri, 19 Dec 2025 21:55:17 GMT
- Title: Faithful and Stable Neuron Explanations for Trustworthy Mechanistic Interpretability
- Authors: Ge Yan, Tuomas Oikarinen, Tsui-Wei, Weng,
- Abstract summary: We argue that neuron identification can be viewed as the inverse process of machine learning.<n>We present the first theoretical analysis of two fundamental challenges: faithfulness and stability.<n> Experiments on both synthetic and real data validate our theoretical results and demonstrate the practicality of our method.
- Score: 2.566497773003048
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Neuron identification is a popular tool in mechanistic interpretability, aiming to uncover the human-interpretable concepts represented by individual neurons in deep networks. While algorithms such as Network Dissection and CLIP-Dissect achieve great empirical success, a rigorous theoretical foundation remains absent, which is crucial to enable trustworthy and reliable explanations. In this work, we observe that neuron identification can be viewed as the inverse process of machine learning, which allows us to derive guarantees for neuron explanations. Based on this insight, we present the first theoretical analysis of two fundamental challenges: (1) Faithfulness: whether the identified concept faithfully represents the neuron's underlying function and (2) Stability: whether the identification results are consistent across probing datasets. We derive generalization bounds for widely used similarity metrics (e.g. accuracy, AUROC, IoU) to guarantee faithfulness, and propose a bootstrap ensemble procedure that quantifies stability along with BE (Bootstrap Explanation) method to generate concept prediction sets with guaranteed coverage probability. Experiments on both synthetic and real data validate our theoretical results and demonstrate the practicality of our method, providing an important step toward trustworthy neuron identification.
Related papers
- An Orthogonal Learner for Individualized Outcomes in Markov Decision Processes [55.93922317950527]
We develop a novel meta-learner called DRQ-learner.<n>Our DRQ-learner is applicable to settings with both discrete and continuous state spaces.
arXiv Detail & Related papers (2025-09-30T15:49:29Z) - The Features at Convergence Theorem: a first-principles alternative to the Neural Feature Ansatz for how networks learn representations [16.67524623230699]
A leading approach is the Neural Feature Ansatz (NFA), a conjectured mechanism for how feature learning occurs.<n>Although the NFA is empirically validated, it is an educated guess and lacks a theoretical basis.<n>We take a first-principles approach to understanding why this observation holds, and when it does not.
arXiv Detail & Related papers (2025-07-08T03:52:48Z) - Did I Faithfully Say What I Thought? Bridging the Gap Between Neural Activity and Self-Explanations in Large Language Models [14.636536897933786]
Large Language Models (LLMs) can generate plausible free text self-explanations to justify their answers.<n>This paper proposes NeuroFaith, a flexible framework that measures the faithfulness of LLM free text self-explanation.
arXiv Detail & Related papers (2025-06-10T22:30:53Z) - Statistical tuning of artificial neural network [0.0]
This study introduces methods to enhance the understanding of neural networks, focusing specifically on models with a single hidden layer.
We propose statistical tests to assess the significance of input neurons and introduce algorithms for dimensionality reduction.
This research advances the field of Explainable Artificial Intelligence by presenting robust statistical frameworks for interpreting neural networks.
arXiv Detail & Related papers (2024-09-24T19:47:03Z) - Utility-Probability Duality of Neural Networks [4.871730595406078]
We propose an alternative utility-based explanation to the standard supervised learning procedure in deep learning.
The basic idea is to interpret the learned neural network not as a probability model but as an ordinal utility function.
We show that for all neural networks with softmax outputs, the SGD learning dynamic of maximum likelihood estimation can be seen as an iteration process.
arXiv Detail & Related papers (2023-05-24T08:09:07Z) - The Unreasonable Effectiveness of Deep Evidential Regression [72.30888739450343]
A new approach with uncertainty-aware regression-based neural networks (NNs) shows promise over traditional deterministic methods and typical Bayesian NNs.
We detail the theoretical shortcomings and analyze the performance on synthetic and real-world data sets, showing that Deep Evidential Regression is a quantification rather than an exact uncertainty.
arXiv Detail & Related papers (2022-05-20T10:10:32Z) - NUQ: Nonparametric Uncertainty Quantification for Deterministic Neural
Networks [151.03112356092575]
We show the principled way to measure the uncertainty of predictions for a classifier based on Nadaraya-Watson's nonparametric estimate of the conditional label distribution.
We demonstrate the strong performance of the method in uncertainty estimation tasks on a variety of real-world image datasets.
arXiv Detail & Related papers (2022-02-07T12:30:45Z) - The Causal Neural Connection: Expressiveness, Learnability, and
Inference [125.57815987218756]
An object called structural causal model (SCM) represents a collection of mechanisms and sources of random variation of the system under investigation.
In this paper, we show that the causal hierarchy theorem (Thm. 1, Bareinboim et al., 2020) still holds for neural models.
We introduce a special type of SCM called a neural causal model (NCM), and formalize a new type of inductive bias to encode structural constraints necessary for performing causal inferences.
arXiv Detail & Related papers (2021-07-02T01:55:18Z) - Evidential Turing Processes [11.021440340896786]
We introduce an original combination of evidential deep learning, neural processes, and neural Turing machines.
We observe our method on three image classification benchmarks and two neural net architectures.
arXiv Detail & Related papers (2021-06-02T15:09:20Z) - Neuro-symbolic Neurodegenerative Disease Modeling as Probabilistic
Programmed Deep Kernels [93.58854458951431]
We present a probabilistic programmed deep kernel learning approach to personalized, predictive modeling of neurodegenerative diseases.
Our analysis considers a spectrum of neural and symbolic machine learning approaches.
We run evaluations on the problem of Alzheimer's disease prediction, yielding results that surpass deep learning.
arXiv Detail & Related papers (2020-09-16T15:16:03Z) - Neuro-symbolic Architectures for Context Understanding [59.899606495602406]
We propose the use of hybrid AI methodology as a framework for combining the strengths of data-driven and knowledge-driven approaches.
Specifically, we inherit the concept of neuro-symbolism as a way of using knowledge-bases to guide the learning progress of deep neural networks.
arXiv Detail & Related papers (2020-03-09T15:04:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.