Fixing confirmation bias in feature attribution methods via semantic
match
- URL: http://arxiv.org/abs/2307.00897v3
- Date: Mon, 26 Feb 2024 10:34:10 GMT
- Title: Fixing confirmation bias in feature attribution methods via semantic
match
- Authors: Giovanni Cin\`a, Daniel Fernandez-Llaneza, Ludovico Deponte, Nishant
Mishra, Tabea E. R\"ober, Sandro Pezzelle, Iacer Calixto, Rob Goedhart,
\c{S}. \.Ilker Birbil
- Abstract summary: We argue that a structured approach is required to test whether our hypotheses on the model are confirmed by the feature attributions.
This is what we call the "semantic match" between human concepts and (sub-symbolic) explanations.
- Score: 4.733072355085082
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Feature attribution methods have become a staple method to disentangle the
complex behavior of black box models. Despite their success, some scholars have
argued that such methods suffer from a serious flaw: they do not allow a
reliable interpretation in terms of human concepts. Simply put, visualizing an
array of feature contributions is not enough for humans to conclude something
about a model's internal representations, and confirmation bias can trick users
into false beliefs about model behavior. We argue that a structured approach is
required to test whether our hypotheses on the model are confirmed by the
feature attributions. This is what we call the "semantic match" between human
concepts and (sub-symbolic) explanations. Building on the conceptual framework
put forward in Cin\`a et al. [2023], we propose a structured approach to
evaluate semantic match in practice. We showcase the procedure in a suite of
experiments spanning tabular and image data, and show how the assessment of
semantic match can give insight into both desirable (e.g., focusing on an
object relevant for prediction) and undesirable model behaviors (e.g., focusing
on a spurious correlation). We couple our experimental results with an analysis
on the metrics to measure semantic match, and argue that this approach
constitutes the first step towards resolving the issue of confirmation bias in
XAI.
Related papers
- Counterfactual Generation from Language Models [64.55296662926919]
We show that counterfactual reasoning is conceptually distinct from interventions.
We propose a framework for generating true string counterfactuals.
Our experiments demonstrate that the approach produces meaningful counterfactuals.
arXiv Detail & Related papers (2024-11-11T17:57:30Z) - Evaluating the Robustness of Interpretability Methods through
Explanation Invariance and Equivariance [72.50214227616728]
Interpretability methods are valuable only if their explanations faithfully describe the explained model.
We consider neural networks whose predictions are invariant under a specific symmetry group.
arXiv Detail & Related papers (2023-04-13T17:59:03Z) - ContraFeat: Contrasting Deep Features for Semantic Discovery [102.4163768995288]
StyleGAN has shown strong potential for disentangled semantic control.
Existing semantic discovery methods on StyleGAN rely on manual selection of modified latent layers to obtain satisfactory manipulation results.
We propose a model that automates this process and achieves state-of-the-art semantic discovery performance.
arXiv Detail & Related papers (2022-12-14T15:22:13Z) - Logical Satisfiability of Counterfactuals for Faithful Explanations in
NLI [60.142926537264714]
We introduce the methodology of Faithfulness-through-Counterfactuals.
It generates a counterfactual hypothesis based on the logical predicates expressed in the explanation.
It then evaluates if the model's prediction on the counterfactual is consistent with that expressed logic.
arXiv Detail & Related papers (2022-05-25T03:40:59Z) - Explain, Edit, and Understand: Rethinking User Study Design for
Evaluating Model Explanations [97.91630330328815]
We conduct a crowdsourcing study, where participants interact with deception detection models that have been trained to distinguish between genuine and fake hotel reviews.
We observe that for a linear bag-of-words model, participants with access to the feature coefficients during training are able to cause a larger reduction in model confidence in the testing phase when compared to the no-explanation control.
arXiv Detail & Related papers (2021-12-17T18:29:56Z) - Counterfactual Evaluation for Explainable AI [21.055319253405603]
We propose a new methodology to evaluate the faithfulness of explanations from the textitcounterfactual reasoning perspective.
We introduce two algorithms to find the proper counterfactuals in both discrete and continuous scenarios and then use the acquired counterfactuals to measure faithfulness.
arXiv Detail & Related papers (2021-09-05T01:38:49Z) - On the Lack of Robust Interpretability of Neural Text Classifiers [14.685352584216757]
We assess the robustness of interpretations of neural text classifiers based on pretrained Transformer encoders.
Both tests show surprising deviations from expected behavior, raising questions about the extent of insights that practitioners may draw from interpretations.
arXiv Detail & Related papers (2021-06-08T18:31:02Z) - Robust Semantic Interpretability: Revisiting Concept Activation Vectors [0.0]
Interpretability methods for image classification attempt to expose whether the model is systematically biased or attending to the same cues as a human would.
Our proposed Robust Concept Activation Vectors (RCAV) quantifies the effects of semantic concepts on individual model predictions and on model behavior as a whole.
arXiv Detail & Related papers (2021-04-06T20:14:59Z) - Pair the Dots: Jointly Examining Training History and Test Stimuli for
Model Interpretability [44.60486560836836]
Any prediction from a model is made by a combination of learning history and test stimuli.
Existing methods to interpret a model's predictions are only able to capture a single aspect of either test stimuli or learning history.
We propose an efficient and differentiable approach to make it feasible to interpret a model's prediction by jointly examining training history and test stimuli.
arXiv Detail & Related papers (2020-10-14T10:45:01Z) - Evaluating the Disentanglement of Deep Generative Models through
Manifold Topology [66.06153115971732]
We present a method for quantifying disentanglement that only uses the generative model.
We empirically evaluate several state-of-the-art models across multiple datasets.
arXiv Detail & Related papers (2020-06-05T20:54:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.