An Experimental Investigation into the Evaluation of Explainability
Methods
- URL: http://arxiv.org/abs/2305.16361v1
- Date: Thu, 25 May 2023 08:07:07 GMT
- Title: An Experimental Investigation into the Evaluation of Explainability
Methods
- Authors: S\'edrick Stassin, Alexandre Englebert, G\'eraldin Nanfack, Julien
Albert, Nassim Versbraegen, Gilles Peiffer, Miriam Doh, Nicolas Riche,
Beno\^it Frenay, Christophe De Vleeschouwer
- Abstract summary: This work compares 14 different metrics when applied to nine state-of-the-art XAI methods and three dummy methods (e.g., random saliency maps) used as references.
Experimental results show which of these metrics produces highly correlated results, indicating potential redundancy.
- Score: 60.54170260771932
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: EXplainable Artificial Intelligence (XAI) aims to help users to grasp the
reasoning behind the predictions of an Artificial Intelligence (AI) system.
Many XAI approaches have emerged in recent years. Consequently, a subfield
related to the evaluation of XAI methods has gained considerable attention,
with the aim to determine which methods provide the best explanation using
various approaches and criteria. However, the literature lacks a comparison of
the evaluation metrics themselves, that one can use to evaluate XAI methods.
This work aims to fill this gap by comparing 14 different metrics when applied
to nine state-of-the-art XAI methods and three dummy methods (e.g., random
saliency maps) used as references. Experimental results show which of these
metrics produces highly correlated results, indicating potential redundancy. We
also demonstrate the significant impact of varying the baseline hyperparameter
on the evaluation metric values. Finally, we use dummy methods to assess the
reliability of metrics in terms of ranking, pointing out their limitations.
Related papers
- Navigating the Maze of Explainable AI: A Systematic Approach to Evaluating Methods and Metrics [10.045644410833402]
We introduce LATEC, a large-scale benchmark that critically evaluates 17 prominent XAI methods using 20 distinct metrics.
We showcase the high risk of conflicting metrics leading to unreliable rankings and consequently propose a more robust evaluation scheme.
LATEC reinforces its role in future XAI research by publicly releasing all 326k saliency maps and 378k metric scores as a (meta-evaluation) dataset.
arXiv Detail & Related papers (2024-09-25T09:07:46Z) - BEExAI: Benchmark to Evaluate Explainable AI [0.9176056742068812]
We propose BEExAI, a benchmark tool that allows large-scale comparison of different post-hoc XAI methods.
We argue that the need for a reliable way of measuring the quality and correctness of explanations is becoming critical.
arXiv Detail & Related papers (2024-07-29T11:21:17Z) - Are Objective Explanatory Evaluation metrics Trustworthy? An Adversarial Analysis [12.921307214813357]
The aim of the paper is to come up with a novel explanatory technique called SHifted Adversaries using Pixel Elimination.
We show that SHAPE is, infact, an adversarial explanation that fools causal metrics that are employed to measure the robustness and reliability of popular importance based visual XAI methods.
arXiv Detail & Related papers (2024-06-12T02:39:46Z) - How much informative is your XAI? A decision-making assessment task to
objectively measure the goodness of explanations [53.01494092422942]
The number and complexity of personalised and user-centred approaches to XAI have rapidly grown in recent years.
It emerged that user-centred approaches to XAI positively affect the interaction between users and systems.
We propose an assessment task to objectively and quantitatively measure the goodness of XAI systems.
arXiv Detail & Related papers (2023-12-07T15:49:39Z) - Precise Benchmarking of Explainable AI Attribution Methods [0.0]
We propose a novel evaluation approach for benchmarking state-of-the-art XAI attribution methods.
Our proposal consists of a synthetic classification model accompanied by its derived ground truth explanations.
Our experimental results provide novel insights into the performance of Guided-Backprop and Smoothgrad XAI methods.
arXiv Detail & Related papers (2023-08-06T17:03:32Z) - Better Understanding Differences in Attribution Methods via Systematic Evaluations [57.35035463793008]
Post-hoc attribution methods have been proposed to identify image regions most influential to the models' decisions.
We propose three novel evaluation schemes to more reliably measure the faithfulness of those methods.
We use these evaluation schemes to study strengths and shortcomings of some widely used attribution methods over a wide range of models.
arXiv Detail & Related papers (2023-03-21T14:24:58Z) - Connecting Algorithmic Research and Usage Contexts: A Perspective of
Contextualized Evaluation for Explainable AI [65.44737844681256]
A lack of consensus on how to evaluate explainable AI (XAI) hinders the advancement of the field.
We argue that one way to close the gap is to develop evaluation methods that account for different user requirements.
arXiv Detail & Related papers (2022-06-22T05:17:33Z) - Towards Better Understanding Attribution Methods [77.1487219861185]
Post-hoc attribution methods have been proposed to identify image regions most influential to the models' decisions.
We propose three novel evaluation schemes to more reliably measure the faithfulness of those methods.
We also propose a post-processing smoothing step that significantly improves the performance of some attribution methods.
arXiv Detail & Related papers (2022-05-20T20:50:17Z) - Data Representing Ground-Truth Explanations to Evaluate XAI Methods [0.0]
Explainable artificial intelligence (XAI) methods are currently evaluated with approaches mostly originated in interpretable machine learning (IML) research.
We propose to represent explanations with canonical equations that can be used to evaluate the accuracy of XAI methods.
arXiv Detail & Related papers (2020-11-18T16:54:53Z) - PONE: A Novel Automatic Evaluation Metric for Open-Domain Generative
Dialogue Systems [48.99561874529323]
There are three kinds of automatic methods to evaluate the open-domain generative dialogue systems.
Due to the lack of systematic comparison, it is not clear which kind of metrics are more effective.
We propose a novel and feasible learning-based metric that can significantly improve the correlation with human judgments.
arXiv Detail & Related papers (2020-04-06T04:36:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.