Related papers: Evaluating Neuron Explanations: A Unified Framework with Sanity Checks

Evaluating Neuron Explanations: A Unified Framework with Sanity Checks

URL: http://arxiv.org/abs/2506.05774v1
Date: Fri, 06 Jun 2025 06:09:47 GMT
Title: Evaluating Neuron Explanations: A Unified Framework with Sanity Checks
Authors: Tuomas Oikarinen, Ge Yan, Tsui-Wei Weng,
Abstract summary: In this work we unify many existing explanation evaluation methods under one mathematical framework.<n>We show that many commonly used metrics fail sanity checks and do not change their score after massive changes to the concept labels.<n>Based on our results, we propose guidelines that future evaluations should follow and identify a set of reliable evaluation metrics.
Score: 15.838061203274897
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Understanding the function of individual units in a neural network is an important building block for mechanistic interpretability. This is often done by generating a simple text explanation of the behavior of individual neurons or units. For these explanations to be useful, we must understand how reliable and truthful they are. In this work we unify many existing explanation evaluation methods under one mathematical framework. This allows us to compare existing evaluation metrics, understand the evaluation pipeline with increased clarity and apply existing statistical methods on the evaluation. In addition, we propose two simple sanity checks on the evaluation metrics and show that many commonly used metrics fail these tests and do not change their score after massive changes to the concept labels. Based on our experimental and theoretical results, we propose guidelines that future evaluations should follow and identify a set of reliable evaluation metrics.

Related papers

Evaluating SAE interpretability without explanations [0.7234862895932991]
We adapt existing methods to assess the interpretability of sparse coders, with the advantage that they do not require generating natural language explanations as an intermediate step.<n>We compare the scores produced by our interpretability metrics with human evaluations across similar tasks and varying setups, offering suggestions for the community on improving the evaluation of these techniques.
arXiv Detail & Related papers (2025-07-11T10:31:53Z)
Classification Metrics for Image Explanations: Towards Building Reliable XAI-Evaluations [0.24578723416255752]
Saliency methods provide (super-)pixelwise feature attribution scores for input images. New evaluation metrics for saliency methods are developed and common saliency methods are benchmarked on ImageNet. A scheme for reliability evaluation of such metrics is proposed that is based on concepts from psychometric testing.
arXiv Detail & Related papers (2024-06-07T16:37:50Z)
The Generalizability of Explanations [0.0]
This work proposes a novel evaluation methodology from the perspective of generalizability. We employ an Autoencoder to learn the distributions of the generated explanations and observe their learnability as well as the plausibility of the learned distributional features.
arXiv Detail & Related papers (2023-02-23T12:25:59Z)
On The Coherence of Quantitative Evaluation of Visual Explanations [0.7212939068975619]
Evaluation methods have been proposed to assess the "goodness" of visual explanations. We study a subset of the ImageNet-1k validation set where we evaluate a number of different commonly-used explanation methods. Results of our study suggest that there is a lack of coherency on the grading provided by some of the considered evaluation methods.
arXiv Detail & Related papers (2023-02-14T13:41:57Z)
On the Blind Spots of Model-Based Evaluation Metrics for Text Generation [79.01422521024834]
We explore a useful but often neglected methodology for robustness analysis of text generation evaluation metrics. We design and synthesize a wide range of potential errors and check whether they result in a commensurate drop in the metric scores. Our experiments reveal interesting insensitivities, biases, or even loopholes in existing metrics.
arXiv Detail & Related papers (2022-12-20T06:24:25Z)
Logical Satisfiability of Counterfactuals for Faithful Explanations in NLI [60.142926537264714]
We introduce the methodology of Faithfulness-through-Counterfactuals. It generates a counterfactual hypothesis based on the logical predicates expressed in the explanation. It then evaluates if the model's prediction on the counterfactual is consistent with that expressed logic.
arXiv Detail & Related papers (2022-05-25T03:40:59Z)
On Quantitative Evaluations of Counterfactuals [88.42660013773647]
This paper consolidates work on evaluating visual counterfactual examples through an analysis and experiments. We find that while most metrics behave as intended for sufficiently simple datasets, some fail to tell the difference between good and bad counterfactuals when the complexity increases. We propose two new metrics, the Label Variation Score and the Oracle score, which are both less vulnerable to such tiny changes.
arXiv Detail & Related papers (2021-10-30T05:00:36Z)
A Statistical Analysis of Summarization Evaluation Metrics using Resampling Methods [60.04142561088524]
We find that the confidence intervals are rather wide, demonstrating high uncertainty in how reliable automatic metrics truly are. Although many metrics fail to show statistical improvements over ROUGE, two recent works, QAEval and BERTScore, do in some evaluation settings.
arXiv Detail & Related papers (2021-03-31T18:28:14Z)
GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics. Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation. It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z)
Evaluations and Methods for Explanation through Robustness Analysis [117.7235152610957]
We establish a novel set of evaluation criteria for such feature based explanations by analysis. We obtain new explanations that are loosely necessary and sufficient for a prediction. We extend the explanation to extract the set of features that would move the current prediction to a target class.
arXiv Detail & Related papers (2020-05-31T05:52:05Z)
Towards GAN Benchmarks Which Require Generalization [48.075521136623564]
We argue that estimating the function must require a large sample from the model. We turn to neural network divergences (NNDs) which are defined in terms of a neural network trained to distinguish between distributions. The resulting benchmarks cannot be "won" by training set memorization, while still being perceptually correlated and computable only from samples.
arXiv Detail & Related papers (2020-01-10T20:18:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.