Related papers: The Quest for the Right Mediator: Surveying Mechanistic Interpretability Through the Lens of Causal Mediation Analysis

The Quest for the Right Mediator: Surveying Mechanistic Interpretability Through the Lens of Causal Mediation Analysis

URL: http://arxiv.org/abs/2408.01416v3
Date: Mon, 29 Sep 2025 21:25:09 GMT
Title: The Quest for the Right Mediator: Surveying Mechanistic Interpretability Through the Lens of Causal Mediation Analysis
Authors: Aaron Mueller, Jannik Brinkmann, Millicent Li, Samuel Marks, Koyena Pal, Nikhil Prakash, Can Rager, Aruna Sankaranarayanan, Arnab Sen Sharma, Jiuding Sun, Eric Todd, David Bau, Yonatan Belinkov,
Abstract summary: We propose a perspective on interpretability research grounded in causal mediation analysis.<n>We describe the history and current state of interpretability taxonomized according to the types of causal units (mediators) employed.<n>We discuss the pros and cons of each mediator, providing insights as to when particular kinds of mediators and search methods are most appropriate.
Score: 51.046457649151336
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Interpretability provides a toolset for understanding how and why neural networks behave in certain ways. However, there is little unity in the field: most studies employ ad-hoc evaluations and do not share theoretical foundations, making it difficult to measure progress and compare the pros and cons of different techniques. Furthermore, while mechanistic understanding is frequently discussed, the basic causal units underlying these mechanisms are often not explicitly defined. In this article, we propose a perspective on interpretability research grounded in causal mediation analysis. Specifically, we describe the history and current state of interpretability taxonomized according to the types of causal units (mediators) employed, as well as methods used to search over mediators. We discuss the pros and cons of each mediator, providing insights as to when particular kinds of mediators and search methods are most appropriate. We argue that this framing yields a more cohesive narrative of the field and helps researchers select appropriate methods based on their research objective. Our analysis yields actionable recommendations for future work, including the discovery of new mediators and the development of standardized evaluations tailored to these goals.

Related papers

METER: Multi-modal Evidence-based Thinking and Explainable Reasoning -- Algorithm and Benchmark [48.78602579128459]
We introduce METER, a unified benchmark for interpretable forgery detection spanning images, videos, audio, and audio-visual content.<n>Our dataset comprises four tracks, each requiring not only real-vs-fake classification but also evidence-chain-based explanations.
arXiv Detail & Related papers (2025-07-22T03:42:51Z)
Why Reasoning Matters? A Survey of Advancements in Multimodal Reasoning (v1) [66.51642638034822]
Reasoning is central to human intelligence, enabling structured problem-solving across diverse tasks. Recent advances in large language models (LLMs) have greatly enhanced their reasoning abilities in arithmetic, commonsense, and symbolic domains. This paper offers a concise yet insightful overview of reasoning techniques in both textual and multimodal LLMs.
arXiv Detail & Related papers (2025-04-04T04:04:56Z)
A Study on Leveraging Search and Self-Feedback for Agent Reasoning [16.256600534996686]
We investigate how search and model's self-feedback can be leveraged for reasoning tasks.<n>First, we explore differences in ground-truth feedback and self-feedback during search for math reasoning.
arXiv Detail & Related papers (2025-02-17T18:12:36Z)
Objective Metrics for Human-Subjects Evaluation in Explainable Reinforcement Learning [0.47355466227925036]
Explanation is a fundamentally human process. Understanding the goal and audience of the explanation is vital.<n>Existing work on explainable reinforcement learning (XRL) routinely does not consult humans in their evaluations.<n>This paper calls on researchers to use objective human metrics for explanation evaluations based on observable and actionable behaviour.
arXiv Detail & Related papers (2025-01-31T16:12:23Z)
CausalEval: Towards Better Causal Reasoning in Language Models [16.55801836321059]
Causal reasoning (CR) is a crucial aspect of intelligence, essential for problem-solving, decision-making, and understanding the world.<n>While language models (LMs) can generate rationales for their outputs, their ability to reliably perform causal reasoning remains uncertain.<n>We introduce CausalEval, a review of research aimed at enhancing LMs for causal reasoning.
arXiv Detail & Related papers (2024-10-22T04:18:19Z)
Leveraging Ontologies to Document Bias in Data [1.0635248457021496]
Doc-BiasO is a resource that aims to create an integrated vocabulary of biases defined in the textitfair-ML literature and their measures. Our main objective is to contribute towards clarifying existing terminology on bias research as it rapidly expands to all areas of AI.
arXiv Detail & Related papers (2024-06-29T18:41:07Z)
When is an Embedding Model More Promising than Another? [33.540506562970776]
Embedders play a central role in machine learning, projecting any object into numerical representations that can be leveraged to perform various downstream tasks. The evaluation of embedding models typically depends on domain-specific empirical approaches. We present a unified approach to evaluate embedders, drawing upon the concepts of sufficiency and informativeness.
arXiv Detail & Related papers (2024-06-11T18:13:46Z)
Distributional Associations vs In-Context Reasoning: A Study of Feed-forward and Attention Layers [49.80959223722325]
We study the distinction between feed-forward and attention layers in large language models. We find that feed-forward layers tend to learn simple distributional associations such as bigrams, while attention layers focus on in-context reasoning.
arXiv Detail & Related papers (2024-06-05T08:51:08Z)
Towards a Unified Framework for Evaluating Explanations [0.6138671548064356]
We argue that explanations serve as mediators between models and stakeholders, whether for intrinsically interpretable models or opaque black-box models. We illustrate these criteria, as well as specific evaluation methods, using examples from an ongoing study of an interpretable neural network for predicting a particular learner behavior.
arXiv Detail & Related papers (2024-05-22T21:49:28Z)
A review on data-driven constitutive laws for solids [0.0]
This review article highlights state-of-the-art data-driven techniques to discover, encode, surrogate, or emulate laws. Our objective is to provide an organized taxonomy to a large spectrum of methodologies developed in the past decades.
arXiv Detail & Related papers (2024-05-06T17:33:58Z)
Towards Non-Adversarial Algorithmic Recourse [20.819764720587646]
It has been argued that adversarial examples, as opposed to counterfactual explanations, have a unique characteristic in that they lead to a misclassification compared to the ground truth. We introduce non-adversarial algorithmic recourse and outline why in high-stakes situations, it is imperative to obtain counterfactual explanations that do not exhibit adversarial characteristics.
arXiv Detail & Related papers (2024-03-15T14:18:21Z)
A Survey on Interpretable Cross-modal Reasoning [64.37362731950843]
Cross-modal reasoning (CMR) has emerged as a pivotal area with applications spanning from multimedia analysis to healthcare diagnostics. This survey delves into the realm of interpretable cross-modal reasoning (I-CMR) This survey presents a comprehensive overview of the typical methods with a three-level taxonomy for I-CMR.
arXiv Detail & Related papers (2023-09-05T05:06:48Z)
Fairness meets Cross-Domain Learning: a new perspective on Models and Metrics [80.07271410743806]
We study the relationship between cross-domain learning (CD) and model fairness. We introduce a benchmark on face and medical images spanning several demographic groups as well as classification and localization tasks. Our study covers 14 CD approaches alongside three state-of-the-art fairness algorithms and shows how the former can outperform the latter.
arXiv Detail & Related papers (2023-03-25T09:34:05Z)
Investigating the Role of Centering Theory in the Context of Neural Coreference Resolution Systems [71.57556446474486]
We investigate the connection between centering theory and modern coreference resolution systems. We show that high-quality neural coreference resolvers may not benefit much from explicitly modeling centering ideas. We formulate a version of CT that also models recency and show that it captures coreference information better compared to vanilla CT.
arXiv Detail & Related papers (2022-10-26T12:55:26Z)
Descriptive vs. inferential community detection in networks: pitfalls, myths, and half-truths [0.0]
We argue that inferential methods are more typically aligned with clearer scientific questions, yield more robust results, and should be in many cases preferred. We attempt to dispel some myths and half-truths often believed when community detection is employed in practice, in an effort to improve both the use of such methods as well as the interpretation of their results.
arXiv Detail & Related papers (2021-11-30T23:57:51Z)
Author Clustering and Topic Estimation for Short Texts [69.54017251622211]
We propose a novel model that expands on the Latent Dirichlet Allocation by modeling strong dependence among the words in the same document. We also simultaneously cluster users, removing the need for post-hoc cluster estimation. Our method performs as well as -- or better -- than traditional approaches to problems arising in short text.
arXiv Detail & Related papers (2021-06-15T20:55:55Z)
Prompting Contrastive Explanations for Commonsense Reasoning Tasks [74.7346558082693]
Large pretrained language models (PLMs) can achieve near-human performance on commonsense reasoning tasks. We show how to use these same models to generate human-interpretable evidence.
arXiv Detail & Related papers (2021-06-12T17:06:13Z)
Prediction or Comparison: Toward Interpretable Qualitative Reasoning [16.02199526395448]
Current approaches use either semantics to transform natural language inputs into logical expressions or a "black-box" model to solve them in one step. In this work, we categorize qualitative reasoning tasks into two types: prediction and comparison. In particular, we adopt neural network modules trained in an end-to-end manner to simulate the two reasoning processes.
arXiv Detail & Related papers (2021-06-04T10:27:55Z)
Individual Explanations in Machine Learning Models: A Survey for Practitioners [69.02688684221265]
The use of sophisticated statistical models that influence decisions in domains of high societal relevance is on the rise. Many governments, institutions, and companies are reluctant to their adoption as their output is often difficult to explain in human-interpretable ways. Recently, the academic literature has proposed a substantial amount of methods for providing interpretable explanations to machine learning models.
arXiv Detail & Related papers (2021-04-09T01:46:34Z)
A Survey on Causal Inference [64.45536158710014]
Causal inference is a critical research topic across many domains, such as statistics, computer science, education, public policy and economics. Various causal effect estimation methods for observational data have sprung up.
arXiv Detail & Related papers (2020-02-05T21:35:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.