Related papers: Mechanistic Interpretability Needs Philosophy

Mechanistic Interpretability Needs Philosophy

URL: http://arxiv.org/abs/2506.18852v1
Date: Mon, 23 Jun 2025 17:13:30 GMT
Title: Mechanistic Interpretability Needs Philosophy
Authors: Iwan Williams, Ninell Oldenburg, Ruchira Dhar, Joshua Hatherley, Constanza Fierro, Nina Rajcic, Sandrine R. Schiller, Filippos Stamatiou, Anders Søgaard,
Abstract summary: We argue that mechanistic interpretability needs philosophy: not as an afterthought, but as an ongoing partner in clarifying its concepts.<n>This position paper illustrates the value philosophy can add to MI research, and outlines a path toward deeper interdisciplinary dialogue.
Score: 32.28998520468988
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Mechanistic interpretability (MI) aims to explain how neural networks work by uncovering their underlying causal mechanisms. As the field grows in influence, it is increasingly important to examine not just models themselves, but the assumptions, concepts and explanatory strategies implicit in MI research. We argue that mechanistic interpretability needs philosophy: not as an afterthought, but as an ongoing partner in clarifying its concepts, refining its methods, and assessing the epistemic and ethical stakes of interpreting AI systems. Taking three open problems from the MI literature as examples, this position paper illustrates the value philosophy can add to MI research, and outlines a path toward deeper interdisciplinary dialogue.

Related papers

Evaluating Explanations: An Explanatory Virtues Framework for Mechanistic Interpretability -- The Strange Science Part I.ii [0.0]
Mechanistic Interpretability aims to understand neural networks through causal explanations.<n>Progress has been limited by the lack of a universal approach to evaluating explanations.<n>We introduce a pluralist Explanatory Virtues Framework to systematically evaluate and improve explanations in MI.
arXiv Detail & Related papers (2025-05-02T16:18:40Z)
A Mathematical Philosophy of Explanations in Mechanistic Interpretability -- The Strange Science Part I.i [0.0]
We argue that Mechanistic Interpretability research is a principled approach to understanding models.<n>We show that Explanatory Faithfulness, an assessment of how well an explanation fits a model, is well-defined.
arXiv Detail & Related papers (2025-05-01T19:08:34Z)
Open Problems in Mechanistic Interpretability [61.44773053835185]
Mechanistic interpretability aims to understand the computational mechanisms underlying neural networks' capabilities.<n>Despite recent progress toward these goals, there are many open problems in the field that require solutions.
arXiv Detail & Related papers (2025-01-27T20:57:18Z)
A Mechanistic Explanatory Strategy for XAI [0.0]
This paper outlines a mechanistic strategy for explaining the functional organization of deep learning systems.<n>The findings suggest that pursuing mechanistic explanations can uncover elements that traditional explainability techniques may overlook.
arXiv Detail & Related papers (2024-11-02T18:30:32Z)
A Theoretical Survey on Foundation Models [48.2313835471321]
This survey aims to review those interpretable methods that comply with the aforementioned principles and have been successfully applied to black-box foundation models. The methods are deeply rooted in machine learning theory, covering the analysis of generalization performance, expressive capability, and dynamic behavior. They provide a thorough interpretation of the entire workflow of FMs, ranging from the inference capability and training dynamics to their ethical implications.
arXiv Detail & Related papers (2024-10-15T09:48:03Z)
Reasoning Circuits in Language Models: A Mechanistic Interpretation of Syllogistic Inference [13.59675117792588]
Recent studies on language models (LMs) have sparked a debate on whether they can learn systematic inferential principles.<n>This paper presents a mechanistic interpretation of syllogistic inference.
arXiv Detail & Related papers (2024-08-16T07:47:39Z)
The Cognitive Revolution in Interpretability: From Explaining Behavior to Interpreting Representations and Algorithms [3.3653074379567096]
mechanistic interpretability (MI) has emerged as a distinct research area studying the features and implicit algorithms learned by foundation models such as large language models. We argue that current methods are ripe to facilitate a transition in deep learning interpretation echoing the "cognitive revolution" in 20th-century psychology. We propose a taxonomy mirroring key parallels in computational neuroscience to describe two broad categories of MI research.
arXiv Detail & Related papers (2024-08-11T20:50:16Z)
Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals [82.68757839524677]
Interpretability research aims to bridge the gap between empirical success and our scientific understanding of large language models (LLMs) We propose a formulation of competition of mechanisms, which focuses on the interplay of multiple mechanisms instead of individual mechanisms. Our findings show traces of the mechanisms and their competition across various model components and reveal attention positions that effectively control the strength of certain mechanisms.
arXiv Detail & Related papers (2024-02-18T17:26:51Z)
Interpreting Pretrained Language Models via Concept Bottlenecks [55.47515772358389]
Pretrained language models (PLMs) have made significant strides in various natural language processing tasks. The lack of interpretability due to their black-box'' nature poses challenges for responsible implementation. We propose a novel approach to interpreting PLMs by employing high-level, meaningful concepts that are easily understandable for humans.
arXiv Detail & Related papers (2023-11-08T20:41:18Z)
DiPlomat: A Dialogue Dataset for Situated Pragmatic Reasoning [89.92601337474954]
Pragmatic reasoning plays a pivotal role in deciphering implicit meanings that frequently arise in real-life conversations. We introduce a novel challenge, DiPlomat, aiming at benchmarking machines' capabilities on pragmatic reasoning and situated conversational understanding.
arXiv Detail & Related papers (2023-06-15T10:41:23Z)
Rethinking Explainability as a Dialogue: A Practitioner's Perspective [57.87089539718344]
We ask doctors, healthcare professionals, and policymakers about their needs and desires for explanations. Our study indicates that decision-makers would strongly prefer interactive explanations in the form of natural language dialogues. Considering these needs, we outline a set of five principles researchers should follow when designing interactive explanations.
arXiv Detail & Related papers (2022-02-03T22:17:21Z)
A general framework for scientifically inspired explanations in AI [76.48625630211943]
We instantiate the concept of structure of scientific explanation as the theoretical underpinning for a general framework in which explanations for AI systems can be implemented. This framework aims to provide the tools to build a "mental-model" of any AI system so that the interaction with the user can provide information on demand and be closer to the nature of human-made explanations.
arXiv Detail & Related papers (2020-03-02T10:32:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.