Evaluation of FEM and MLFEM AI-explainers in Image Classification tasks
with reference-based and no-reference metrics
- URL: http://arxiv.org/abs/2212.01222v1
- Date: Fri, 2 Dec 2022 14:55:31 GMT
- Title: Evaluation of FEM and MLFEM AI-explainers in Image Classification tasks
with reference-based and no-reference metrics
- Authors: A. Zhukov, J. Benois-Pineau, R. Giot
- Abstract summary: We remind recently proposed post-hoc explainers FEM and MLFEM which have been designed for explanations of CNNs in image and video classification tasks.
We propose their evaluation with reference-based and no-reference metrics.
As a no-reference metric we use "stability" metric, proposed by Alvarez-Melis and Jaakkola.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The most popular methods and algorithms for AI are, for the vast majority,
black boxes. Black boxes can be an acceptable solution to unimportant problems
(in the sense of the degree of impact) but have a fatal flaw for the rest.
Therefore the explanation tools for them have been quickly developed. The
evaluation of their quality remains an open research question. In this
technical report, we remind recently proposed post-hoc explainers FEM and MLFEM
which have been designed for explanations of CNNs in image and video
classification tasks. We also propose their evaluation with reference-based and
no-reference metrics. The reference-based metrics are Pearson Correlation
coefficient and Similarity computed between the explanation maps and the ground
truth, which is represented by Gaze Fixation Density Maps obtained due to a
psycho-visual experiment. As a no-reference metric we use "stability" metric,
proposed by Alvarez-Melis and Jaakkola. We study its behaviour, consensus with
reference-based metrics and show that in case of several kind of degradations
on input images, this metric is in agreement with reference-based ones.
Therefore it can be used for evaluation of the quality of explainers when the
ground truth is not available.
Related papers
- Classification Metrics for Image Explanations: Towards Building Reliable XAI-Evaluations [0.24578723416255752]
Saliency methods provide (super-)pixelwise feature attribution scores for input images.
New evaluation metrics for saliency methods are developed and common saliency methods are benchmarked on ImageNet.
A scheme for reliability evaluation of such metrics is proposed that is based on concepts from psychometric testing.
arXiv Detail & Related papers (2024-06-07T16:37:50Z) - A Closer Look at Classification Evaluation Metrics and a Critical Reflection of Common Evaluation Practice [6.091702876917282]
Classification systems are evaluated in a countless number of papers.
However, we find that evaluation practice is often nebulous.
Many works use so-called'macro' metrics to rank systems but do not clearly specify what they would expect from such a metric.
arXiv Detail & Related papers (2024-04-25T18:12:43Z) - Cobra Effect in Reference-Free Image Captioning Metrics [58.438648377314436]
A proliferation of reference-free methods, leveraging visual-language pre-trained models (VLMs), has emerged.
In this paper, we study if there are any deficiencies in reference-free metrics.
We employ GPT-4V as an evaluative tool to assess generated sentences and the result reveals that our approach achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2024-02-18T12:36:23Z) - Towards Multiple References Era -- Addressing Data Leakage and Limited
Reference Diversity in NLG Evaluation [55.92852268168816]
N-gram matching-based evaluation metrics, such as BLEU and chrF, are widely utilized across a range of natural language generation (NLG) tasks.
Recent studies have revealed a weak correlation between these matching-based metrics and human evaluations.
We propose to utilize textitmultiple references to enhance the consistency between these metrics and human evaluations.
arXiv Detail & Related papers (2023-08-06T14:49:26Z) - DCID: Deep Canonical Information Decomposition [84.59396326810085]
We consider the problem of identifying the signal shared between two one-dimensional target variables.
We propose ICM, an evaluation metric which can be used in the presence of ground-truth labels.
We also propose Deep Canonical Information Decomposition (DCID) - a simple, yet effective approach for learning the shared variables.
arXiv Detail & Related papers (2023-06-27T16:59:06Z) - On the Blind Spots of Model-Based Evaluation Metrics for Text Generation [79.01422521024834]
We explore a useful but often neglected methodology for robustness analysis of text generation evaluation metrics.
We design and synthesize a wide range of potential errors and check whether they result in a commensurate drop in the metric scores.
Our experiments reveal interesting insensitivities, biases, or even loopholes in existing metrics.
arXiv Detail & Related papers (2022-12-20T06:24:25Z) - Rethinking Knowledge Graph Evaluation Under the Open-World Assumption [65.20527611711697]
Most knowledge graphs (KGs) are incomplete, which motivates one important research topic on automatically complementing knowledge graphs.
Treating all unknown triplets as false is called the closed-world assumption.
In this paper, we study KGC evaluation under a more realistic setting, namely the open-world assumption.
arXiv Detail & Related papers (2022-09-19T09:01:29Z) - The Solvability of Interpretability Evaluation Metrics [7.3709604810699085]
Feature attribution methods are often evaluated on metrics such as comprehensiveness and sufficiency.
In this paper, we highlight an intriguing property of these metrics: their solvability.
We present a series of investigations showing that this beam search explainer is generally comparable or favorable to current choices.
arXiv Detail & Related papers (2022-05-18T02:52:03Z) - Metrics for saliency map evaluation of deep learning explanation methods [0.0]
We critically analyze the Deletion Area Under Curve (DAUC) and Insertion Area Under Curve (IAUC) metrics proposed by Petsiuk et al.
These metrics were designed to evaluate the faithfulness of saliency maps generated by generic methods such as Grad-CAM or RISE.
We show that the actual saliency score values given by the saliency map are ignored as only the ranking of the scores is taken into account.
arXiv Detail & Related papers (2022-01-31T14:59:36Z) - Evaluation Metrics for Conditional Image Generation [100.69766435176557]
We present two new metrics for evaluating generative models in the class-conditional image generation setting.
A theoretical analysis shows the motivation behind each proposed metric and links the novel metrics to their unconditional counterparts.
We provide an extensive empirical evaluation, comparing the metrics to their unconditional variants and to other metrics, and utilize them to analyze existing generative models.
arXiv Detail & Related papers (2020-04-26T12:15:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.