Right for the Wrong Reason: Can Interpretable ML Techniques Detect
Spurious Correlations?
- URL: http://arxiv.org/abs/2307.12344v2
- Date: Tue, 8 Aug 2023 14:52:39 GMT
- Title: Right for the Wrong Reason: Can Interpretable ML Techniques Detect
Spurious Correlations?
- Authors: Susu Sun, Lisa M. Koch, Christian F. Baumgartner
- Abstract summary: We propose a rigorous evaluation strategy to assess an explanation technique's ability to correctly identify spurious correlations.
We find that the post-hoc technique SHAP, as well as the inherently interpretable Attri-Net provide the best performance.
- Score: 2.7558542803110244
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: While deep neural network models offer unmatched classification performance,
they are prone to learning spurious correlations in the data. Such dependencies
on confounding information can be difficult to detect using performance metrics
if the test data comes from the same distribution as the training data.
Interpretable ML methods such as post-hoc explanations or inherently
interpretable classifiers promise to identify faulty model reasoning. However,
there is mixed evidence whether many of these techniques are actually able to
do so. In this paper, we propose a rigorous evaluation strategy to assess an
explanation technique's ability to correctly identify spurious correlations.
Using this strategy, we evaluate five post-hoc explanation techniques and one
inherently interpretable method for their ability to detect three types of
artificially added confounders in a chest x-ray diagnosis task. We find that
the post-hoc technique SHAP, as well as the inherently interpretable Attri-Net
provide the best performance and can be used to reliably identify faulty model
behavior.
Related papers
- DISCO: DISCovering Overfittings as Causal Rules for Text Classification Models [6.369258625916601]
Post-hoc interpretability methods fail to capture the models' decision-making process fully.
Our paper introduces DISCO, a novel method for discovering global, rule-based explanations.
DISCO supports interactive explanations, enabling human inspectors to distinguish spurious causes in the rule-based output.
arXiv Detail & Related papers (2024-11-07T12:12:44Z) - Pretraining Data Detection for Large Language Models: A Divergence-based Calibration Method [108.56493934296687]
We introduce a divergence-based calibration method, inspired by the divergence-from-randomness concept, to calibrate token probabilities for pretraining data detection.
We have developed a Chinese-language benchmark, PatentMIA, to assess the performance of detection approaches for LLMs on Chinese text.
arXiv Detail & Related papers (2024-09-23T07:55:35Z) - Measuring and Improving Attentiveness to Partial Inputs with Counterfactuals [91.59906995214209]
We propose a new evaluation method, Counterfactual Attentiveness Test (CAT)
CAT uses counterfactuals by replacing part of the input with its counterpart from a different example, expecting an attentive model to change its prediction.
We show that GPT3 becomes less attentive with an increased number of demonstrations, while its accuracy on the test data improves.
arXiv Detail & Related papers (2023-11-16T06:27:35Z) - XAL: EXplainable Active Learning Makes Classifiers Better Low-resource Learners [71.8257151788923]
We propose a novel Explainable Active Learning framework (XAL) for low-resource text classification.
XAL encourages classifiers to justify their inferences and delve into unlabeled data for which they cannot provide reasonable explanations.
Experiments on six datasets show that XAL achieves consistent improvement over 9 strong baselines.
arXiv Detail & Related papers (2023-10-09T08:07:04Z) - CLIMAX: An exploration of Classifier-Based Contrastive Explanations [5.381004207943597]
We propose a novel post-hoc model XAI technique that provides contrastive explanations justifying the classification of a black box.
Our method, which we refer to as CLIMAX, is based on local classifiers.
We show that we achieve better consistency as compared to baselines such as LIME, BayLIME, and SLIME.
arXiv Detail & Related papers (2023-07-02T22:52:58Z) - Post hoc Explanations may be Ineffective for Detecting Unknown Spurious
Correlation [12.185584875925906]
We investigate whether three types of post hoc model explanations are effective for detecting a model's reliance on spurious signals in the training data.
We design an empirical methodology that uses semi-synthetic datasets along with pre-specified spurious artifacts.
We find that the post hoc explanation methods tested are ineffective when the spurious artifact is unknown at test-time.
arXiv Detail & Related papers (2022-12-09T02:05:39Z) - Explainer Divergence Scores (EDS): Some Post-Hoc Explanations May be
Effective for Detecting Unknown Spurious Correlations [4.223964614888875]
Post-hoc explainers might be ineffective for detecting spurious correlations in Deep Neural Networks (DNNs)
We show there are serious weaknesses with the existing evaluation frameworks for this setting.
We propose a new evaluation methodology, Explainer Divergence Scores (EDS), grounded in an information theory approach to evaluate explainers.
arXiv Detail & Related papers (2022-11-14T15:52:21Z) - Discriminative Attribution from Counterfactuals [64.94009515033984]
We present a method for neural network interpretability by combining feature attribution with counterfactual explanations.
We show that this method can be used to quantitatively evaluate the performance of feature attribution methods in an objective manner.
arXiv Detail & Related papers (2021-09-28T00:53:34Z) - Combining Feature and Instance Attribution to Detect Artifacts [62.63504976810927]
We propose methods to facilitate identification of training data artifacts.
We show that this proposed training-feature attribution approach can be used to uncover artifacts in training data.
We execute a small user study to evaluate whether these methods are useful to NLP researchers in practice.
arXiv Detail & Related papers (2021-07-01T09:26:13Z) - Have We Learned to Explain?: How Interpretability Methods Can Learn to
Encode Predictions in their Interpretations [20.441578071446212]
We introduce EVAL-X as a method to quantitatively evaluate interpretations and REAL-X as an amortized explanation method.
We show EVAL-X can detect when predictions are encoded in interpretations and show the advantages of REAL-X through quantitative and radiologist evaluation.
arXiv Detail & Related papers (2021-03-02T17:42:33Z) - Evaluating Explainable AI: Which Algorithmic Explanations Help Users
Predict Model Behavior? [97.77183117452235]
We carry out human subject tests to isolate the effect of algorithmic explanations on model interpretability.
Clear evidence of method effectiveness is found in very few cases.
Our results provide the first reliable and comprehensive estimates of how explanations influence simulatability.
arXiv Detail & Related papers (2020-05-04T20:35:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.