Towards a Theory of Faithfulness: Faithful Explanations of
Differentiable Classifiers over Continuous Data
- URL: http://arxiv.org/abs/2205.09620v1
- Date: Thu, 19 May 2022 15:34:21 GMT
- Title: Towards a Theory of Faithfulness: Faithful Explanations of
Differentiable Classifiers over Continuous Data
- Authors: Nico Potyka, Xiang Yin, Francesca Toni
- Abstract summary: We propose two formal definitions of faithfulness for feature attribution methods.
qualitative faithfulness demands that scores reflect the true qualitative effect (positive vs. negative) of the feature on the model.
We experimentally demonstrate that popular attribution methods can fail to give faithful explanations in the setting where the data is continuous.
- Score: 17.9926469947157
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: There is broad agreement in the literature that explanation methods should be
faithful to the model that they explain, but faithfulness remains a rather
vague term. We revisit faithfulness in the context of continuous data and
propose two formal definitions of faithfulness for feature attribution methods.
Qualitative faithfulness demands that scores reflect the true qualitative
effect (positive vs. negative) of the feature on the model and quanitative
faithfulness that the magnitude of scores reflect the true quantitative effect.
We discuss under which conditions these requirements can be satisfied to which
extent (local vs global). As an application of the conceptual idea, we look at
differentiable classifiers over continuous data and characterize
Gradient-scores as follows: every qualitatively faithful feature attribution
method is qualitatively equivalent to Gradient-scores. Furthermore, if an
attribution method is quantitatively faithful in the sense that changes of the
output of the classifier are proportional to the scores of features, then it is
either equivalent to gradient-scoring or it is based on an inferior
approximation of the classifier. To illustrate the practical relevance of the
theory, we experimentally demonstrate that popular attribution methods can fail
to give faithful explanations in the setting where the data is continuous and
the classifier differentiable.
Related papers
- Unlearning-based Neural Interpretations [51.99182464831169]
We show that current baselines defined using static functions are biased, fragile and manipulable.
We propose UNI to compute an (un)learnable, debiased and adaptive baseline by perturbing the input towards an unlearning direction of steepest ascent.
arXiv Detail & Related papers (2024-10-10T16:02:39Z) - Enriching Disentanglement: From Logical Definitions to Quantitative Metrics [59.12308034729482]
Disentangling the explanatory factors in complex data is a promising approach for data-efficient representation learning.
We establish relationships between logical definitions and quantitative metrics to derive theoretically grounded disentanglement metrics.
We empirically demonstrate the effectiveness of the proposed metrics by isolating different aspects of disentangled representations.
arXiv Detail & Related papers (2023-05-19T08:22:23Z) - Measuring Implicit Bias Using SHAP Feature Importance and Fuzzy
Cognitive Maps [1.9739269019020032]
In this paper, we integrate the concepts of feature importance with implicit bias in the context of pattern classification.
The amount of bias towards protected features might differ depending on whether the features are numerically or categorically encoded.
arXiv Detail & Related papers (2023-05-16T12:31:36Z) - Comparing Explanation Methods for Traditional Machine Learning Models
Part 2: Quantifying Model Explainability Faithfulness and Improvements with
Dimensionality Reduction [0.0]
"faithfulness" or "fidelity" refer to the correspondence between the assigned feature importance and the contribution of the feature to model performance.
This study is one of the first to quantify the improvement in explainability from limiting correlated features and knowing the relative fidelity of different explainability methods.
arXiv Detail & Related papers (2022-11-18T17:15:59Z) - Logical Satisfiability of Counterfactuals for Faithful Explanations in
NLI [60.142926537264714]
We introduce the methodology of Faithfulness-through-Counterfactuals.
It generates a counterfactual hypothesis based on the logical predicates expressed in the explanation.
It then evaluates if the model's prediction on the counterfactual is consistent with that expressed logic.
arXiv Detail & Related papers (2022-05-25T03:40:59Z) - Discriminative Attribution from Counterfactuals [64.94009515033984]
We present a method for neural network interpretability by combining feature attribution with counterfactual explanations.
We show that this method can be used to quantitatively evaluate the performance of feature attribution methods in an objective manner.
arXiv Detail & Related papers (2021-09-28T00:53:34Z) - Counterfactual Evaluation for Explainable AI [21.055319253405603]
We propose a new methodology to evaluate the faithfulness of explanations from the textitcounterfactual reasoning perspective.
We introduce two algorithms to find the proper counterfactuals in both discrete and continuous scenarios and then use the acquired counterfactuals to measure faithfulness.
arXiv Detail & Related papers (2021-09-05T01:38:49Z) - Towards Unifying Feature Attribution and Counterfactual Explanations:
Different Means to the Same End [17.226134854746267]
We present a method to generate feature attribution explanations from a set of counterfactual examples.
We show how counterfactual examples can be used to evaluate the goodness of an attribution-based explanation in terms of its necessity and sufficiency.
arXiv Detail & Related papers (2020-11-10T05:41:43Z) - Learning Disentangled Representations with Latent Variation
Predictability [102.4163768995288]
This paper defines the variation predictability of latent disentangled representations.
Within an adversarial generation process, we encourage variation predictability by maximizing the mutual information between latent variations and corresponding image pairs.
We develop an evaluation metric that does not rely on the ground-truth generative factors to measure the disentanglement of latent representations.
arXiv Detail & Related papers (2020-07-25T08:54:26Z) - Achieving Equalized Odds by Resampling Sensitive Attributes [13.114114427206678]
We present a flexible framework for learning predictive models that approximately satisfy the equalized odds notion of fairness.
This differentiable functional is used as a penalty driving the model parameters towards equalized odds.
We develop a formal hypothesis test to detect whether a prediction rule violates this property, the first such test in the literature.
arXiv Detail & Related papers (2020-06-08T00:18:34Z) - Towards Faithfully Interpretable NLP Systems: How should we define and
evaluate faithfulness? [58.13152510843004]
With the growing popularity of deep-learning based NLP models, comes a need for interpretable systems.
What is interpretability, and what constitutes a high-quality interpretation?
We call for more clearly differentiating between different desired criteria an interpretation should satisfy, and focus on the faithfulness criteria.
arXiv Detail & Related papers (2020-04-07T20:15:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.