Related papers: Mechanistic Interpretability as Statistical Estimation: A Variance Analysis of EAP-IG

Mechanistic Interpretability as Statistical Estimation: A Variance Analysis of EAP-IG

URL: http://arxiv.org/abs/2510.00845v2
Date: Thu, 02 Oct 2025 11:16:27 GMT
Title: Mechanistic Interpretability as Statistical Estimation: A Variance Analysis of EAP-IG
Authors: Maxime Méloux, François Portet, Maxime Peyrard,
Abstract summary: We argue that interpretability methods, such as circuit discovery, should be viewed as statistical estimators.<n>We present a systematic stability analysis of a state-of-the-art circuit discovery method: EAP-IG.
Score: 10.620784202716404
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The development of trustworthy artificial intelligence requires moving beyond black-box performance metrics toward an understanding of models' internal computations. Mechanistic Interpretability (MI) aims to meet this need by identifying the algorithmic mechanisms underlying model behaviors. Yet, the scientific rigor of MI critically depends on the reliability of its findings. In this work, we argue that interpretability methods, such as circuit discovery, should be viewed as statistical estimators, subject to questions of variance and robustness. To illustrate this statistical framing, we present a systematic stability analysis of a state-of-the-art circuit discovery method: EAP-IG. We evaluate its variance and robustness through a comprehensive suite of controlled perturbations, including input resampling, prompt paraphrasing, hyperparameter variation, and injected noise within the causal analysis itself. Across a diverse set of models and tasks, our results demonstrate that EAP-IG exhibits high structural variance and sensitivity to hyperparameters, questioning the stability of its findings. Based on these results, we offer a set of best-practice recommendations for the field, advocating for the routine reporting of stability metrics to promote a more rigorous and statistically grounded science of interpretability.

Related papers

Learning Complex Physical Regimes via Coverage-oriented Uncertainty Quantification: An application to the Critical Heat Flux [0.0]
Uncertainty quantification (UQ) should not be viewed as a safety assessment, but as a support to the learning task itself.<n>We focus on the Critical Heat Flux benchmark and dataset presented by the OECD/NEA Expert Group on Reactor Systems Multi-Physics.<n>We show that while post-hoc methods ensure statistical calibration, coverage-oriented learning effectively reshapes the model's representation to match the complex physical regimes.
arXiv Detail & Related papers (2026-02-25T09:04:15Z)
STAR : Bridging Statistical and Agentic Reasoning for Large Model Performance Prediction [78.0692157478247]
We propose STAR, a framework that bridges data-driven STatistical expectations with knowledge-driven Agentic Reasoning.<n>We show that STAR consistently outperforms all baselines on both score-based and rank-based metrics.
arXiv Detail & Related papers (2026-02-12T16:30:07Z)
Equivariant Evidential Deep Learning for Interatomic Potentials [55.6997213490859]
Uncertainty quantification is critical for assessing the reliability of machine learning interatomic potentials in molecular dynamics simulations.<n>Existing UQ approaches for MLIPs are often limited by high computational cost or suboptimal performance.<n>We propose textitEquivariant Evidential Deep Learning for Interatomic Potentials ($texte2$IP), a backbone-agnostic framework that models atomic forces and their uncertainty jointly.
arXiv Detail & Related papers (2026-02-11T02:00:25Z)
Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors [61.92704516732144]
We show that the most robust features for correctness prediction are those that play a distinctive causal role in the model's behavior.<n>We propose two methods that leverage causal mechanisms to predict the correctness of model outputs.
arXiv Detail & Related papers (2025-05-17T00:31:39Z)
Breach in the Shield: Unveiling the Vulnerabilities of Large Language Models [13.216398753024182]
Large Language Models (LLMs) and Vision-Language Models (VLMs) have achieved impressive performance across a wide range of tasks.<n>In this study, we seek to pinpoint the sources of this fragility by identifying parameters and input dimensions that are susceptible to such perturbations.<n>We propose a stability measure called textbfFI, textbfFirst order local textbfInfluence, which is rooted in information geometry and quantifies the sensitivity of individual parameter and input dimensions.
arXiv Detail & Related papers (2025-03-28T16:23:59Z)
Predictability Analysis of Regression Problems via Conditional Entropy Estimations [1.8913544072080544]
Conditional entropy estimators are developed to assess predictability in regression problems. Experiments on synthesized and real-world datasets demonstrate the robustness and utility of these estimators.
arXiv Detail & Related papers (2024-06-06T07:59:19Z)
FUSE: Fast Unified Simulation and Estimation for PDEs [11.991297011923004]
We argue that solving both problems within the same framework can lead to consistent gains in accuracy and robustness. We present the capabilities of the proposed methodology for predicting continuous and discrete biomarkers in full-body haemodynamics simulations.
arXiv Detail & Related papers (2024-05-23T13:37:26Z)
The Risk of Federated Learning to Skew Fine-Tuning Features and Underperform Out-of-Distribution Robustness [50.52507648690234]
Federated learning has the risk of skewing fine-tuning features and compromising the robustness of the model. We introduce three robustness indicators and conduct experiments across diverse robust datasets. Our approach markedly enhances the robustness across diverse scenarios, encompassing various parameter-efficient fine-tuning methods.
arXiv Detail & Related papers (2024-01-25T09:18:51Z)
Simulation-based Inference for Cardiovascular Models [43.55219268578912]
We use simulation-based inference to solve the inverse problem of mapping waveforms back to plausible physiological parameters.<n>We perform an in-silico uncertainty analysis of five biomarkers of clinical interest.<n>We study the gap between in-vivo and in-silico with the MIMIC-III waveform database.
arXiv Detail & Related papers (2023-07-26T02:34:57Z)
MAntRA: A framework for model agnostic reliability analysis [0.0]
We propose a novel model data-driven reliability analysis framework for time-dependent reliability analysis. The proposed approach combines interpretable machine learning, Bayesian statistics, and identifying dynamic equation. Results indicate the possible application of the proposed approach for reliability analysis of insitu and heritage structures from on-site measurements.
arXiv Detail & Related papers (2022-12-13T00:57:09Z)
Differential privacy and robust statistics in high dimensions [49.50869296871643]
High-dimensional Propose-Test-Release (HPTR) builds upon three crucial components: the exponential mechanism, robust statistics, and the Propose-Test-Release mechanism. We show that HPTR nearly achieves the optimal sample complexity under several scenarios studied in the literature.
arXiv Detail & Related papers (2021-11-12T06:36:40Z)
Latent Causal Invariant Model [128.7508609492542]
Current supervised learning can learn spurious correlation during the data-fitting process. We propose a Latent Causal Invariance Model (LaCIM) which pursues causal prediction.
arXiv Detail & Related papers (2020-11-04T10:00:27Z)
Uncertainty Quantification in Extreme Learning Machine: Analytical Developments, Variance Estimates and Confidence Intervals [0.0]
Uncertainty quantification is crucial to assess prediction quality of a machine learning model. Most methods proposed in the literature make strong assumptions on the data, ignore the randomness of input weights or neglect the bias contribution in confidence interval estimations. This paper presents novel estimations that overcome these constraints and improve the understanding of ELM variability.
arXiv Detail & Related papers (2020-11-03T13:45:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.