Metrics for saliency map evaluation of deep learning explanation methods
- URL: http://arxiv.org/abs/2201.13291v1
- Date: Mon, 31 Jan 2022 14:59:36 GMT
- Title: Metrics for saliency map evaluation of deep learning explanation methods
- Authors: Tristan Gomez, Thomas Fr\'eour, Harold Mouch\`ere
- Abstract summary: We critically analyze the Deletion Area Under Curve (DAUC) and Insertion Area Under Curve (IAUC) metrics proposed by Petsiuk et al.
These metrics were designed to evaluate the faithfulness of saliency maps generated by generic methods such as Grad-CAM or RISE.
We show that the actual saliency score values given by the saliency map are ignored as only the ranking of the scores is taken into account.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Due to the black-box nature of deep learning models, there is a recent
development of solutions for visual explanations of CNNs. Given the high cost
of user studies, metrics are necessary to compare and evaluate these different
methods. In this paper, we critically analyze the Deletion Area Under Curve
(DAUC) and Insertion Area Under Curve (IAUC) metrics proposed by Petsiuk et al.
(2018). These metrics were designed to evaluate the faithfulness of saliency
maps generated by generic methods such as Grad-CAM or RISE. First, we show that
the actual saliency score values given by the saliency map are ignored as only
the ranking of the scores is taken into account. This shows that these metrics
are insufficient by themselves, as the visual appearance of a saliency map can
change significantly without the ranking of the scores being modified.
Secondly, we argue that during the computation of DAUC and IAUC, the model is
presented with images that are out of the training distribution which might
lead to an unreliable behavior of the model being explained. %First, we show
that one can drastically change the visual appearance of an explanation map
without changing the pixel ranking, i.e. without changing the DAUC and IAUC
values. %We argue that DAUC and IAUC only takes into account the scores ranking
and ignore the score values. To complement DAUC/IAUC, we propose new metrics
that quantify the sparsity and the calibration of explanation methods, two
previously unstudied properties. Finally, we give general remarks about the
metrics studied in this paper and discuss how to evaluate them in a user study.
Related papers
- Schroedinger's Threshold: When the AUC doesn't predict Accuracy [6.091702876917282]
Area Under Curve measure (AUC) seems apt to evaluate and compare diverse models.
We show that the AUC yields an academic and optimistic notion of accuracy that can misalign with the actual accuracy observed in application.
arXiv Detail & Related papers (2024-04-04T10:18:03Z) - Cobra Effect in Reference-Free Image Captioning Metrics [58.438648377314436]
A proliferation of reference-free methods, leveraging visual-language pre-trained models (VLMs), has emerged.
In this paper, we study if there are any deficiencies in reference-free metrics.
We employ GPT-4V as an evaluative tool to assess generated sentences and the result reveals that our approach achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2024-02-18T12:36:23Z) - Evaluation of FEM and MLFEM AI-explainers in Image Classification tasks
with reference-based and no-reference metrics [0.0]
We remind recently proposed post-hoc explainers FEM and MLFEM which have been designed for explanations of CNNs in image and video classification tasks.
We propose their evaluation with reference-based and no-reference metrics.
As a no-reference metric we use "stability" metric, proposed by Alvarez-Melis and Jaakkola.
arXiv Detail & Related papers (2022-12-02T14:55:31Z) - ATCON: Attention Consistency for Vision Models [0.8312466807725921]
We propose an unsupervised fine-tuning method that improves the consistency of attention maps.
We show results on Grad-CAM and Integrated Gradients in an ablation study.
Those improved attention maps may help clinicians better understand vision model predictions.
arXiv Detail & Related papers (2022-10-18T09:30:20Z) - Attributing AUC-ROC to Analyze Binary Classifier Performance [13.192005156790302]
We discuss techniques to segment the Area Under the Receiver Operating Characteristic Curve (AUC-ROC) along human-interpretable dimensions.
AUC-ROC is not an additive/linear function over the data samples, therefore such segmenting the overall AUC-ROC is different from tabulating the AUC-ROC of data segments.
arXiv Detail & Related papers (2022-05-24T04:42:52Z) - CIM: Class-Irrelevant Mapping for Few-Shot Classification [58.02773394658623]
Few-shot classification (FSC) is one of the most concerned hot issues in recent years.
How to appraise the pre-trained FEM is the most crucial focus in the FSC community.
We propose a simple, flexible method, dubbed as Class-Irrelevant Mapping (CIM)
arXiv Detail & Related papers (2021-09-07T03:26:24Z) - CAMERAS: Enhanced Resolution And Sanity preserving Class Activation
Mapping for image saliency [61.40511574314069]
Backpropagation image saliency aims at explaining model predictions by estimating model-centric importance of individual pixels in the input.
We propose CAMERAS, a technique to compute high-fidelity backpropagation saliency maps without requiring any external priors.
arXiv Detail & Related papers (2021-06-20T08:20:56Z) - A Sober Look at the Unsupervised Learning of Disentangled
Representations and their Evaluation [63.042651834453544]
We show that the unsupervised learning of disentangled representations is impossible without inductive biases on both the models and the data.
We observe that while the different methods successfully enforce properties "encouraged" by the corresponding losses, well-disentangled models seemingly cannot be identified without supervision.
Our results suggest that future work on disentanglement learning should be explicit about the role of inductive biases and (implicit) supervision.
arXiv Detail & Related papers (2020-10-27T10:17:15Z) - Evaluation Metrics for Conditional Image Generation [100.69766435176557]
We present two new metrics for evaluating generative models in the class-conditional image generation setting.
A theoretical analysis shows the motivation behind each proposed metric and links the novel metrics to their unconditional counterparts.
We provide an extensive empirical evaluation, comparing the metrics to their unconditional variants and to other metrics, and utilize them to analyze existing generative models.
arXiv Detail & Related papers (2020-04-26T12:15:16Z) - Uncertainty based Class Activation Maps for Visual Question Answering [30.859101872119517]
We propose a method that obtains gradient-based certainty estimates that also provide visual attention maps.
We incorporate modern probabilistic deep learning methods that we further improve by using the gradients for these estimates.
The proposed technique can be thought of as a recipe for obtaining improved certainty estimates and explanations for deep learning models.
arXiv Detail & Related papers (2020-01-23T19:54:19Z) - Towards GAN Benchmarks Which Require Generalization [48.075521136623564]
We argue that estimating the function must require a large sample from the model.
We turn to neural network divergences (NNDs) which are defined in terms of a neural network trained to distinguish between distributions.
The resulting benchmarks cannot be "won" by training set memorization, while still being perceptually correlated and computable only from samples.
arXiv Detail & Related papers (2020-01-10T20:18:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.