From Anecdotal Evidence to Quantitative Evaluation Methods: A Systematic
Review on Evaluating Explainable AI
- URL: http://arxiv.org/abs/2201.08164v1
- Date: Thu, 20 Jan 2022 13:23:20 GMT
- Title: From Anecdotal Evidence to Quantitative Evaluation Methods: A Systematic
Review on Evaluating Explainable AI
- Authors: Meike Nauta, Jan Trienes, Shreyasi Pathak, Elisa Nguyen, Michelle
Peters, Yasmin Schmitt, J\"org Schl\"otterer, Maurice van Keulen, Christin
Seifert
- Abstract summary: We identify 12 conceptual properties, such as Compactness and Correctness, that should be evaluated for comprehensively assessing the quality of an explanation.
We find that 1 in 3 papers evaluate exclusively with anecdotal evidence, and 1 in 5 papers evaluate with users.
This systematic collection of evaluation methods provides researchers and practitioners with concrete tools to thoroughly validate, benchmark and compare new and existing XAI methods.
- Score: 3.7592122147132776
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rising popularity of explainable artificial intelligence (XAI) to
understand high-performing black boxes, also raised the question of how to
evaluate explanations of machine learning (ML) models. While interpretability
and explainability are often presented as a subjectively validated binary
property, we consider it a multi-faceted concept. We identify 12 conceptual
properties, such as Compactness and Correctness, that should be evaluated for
comprehensively assessing the quality of an explanation. Our so-called Co-12
properties serve as categorization scheme for systematically reviewing the
evaluation practice of more than 300 papers published in the last 7 years at
major AI and ML conferences that introduce an XAI method. We find that 1 in 3
papers evaluate exclusively with anecdotal evidence, and 1 in 5 papers evaluate
with users. We also contribute to the call for objective, quantifiable
evaluation methods by presenting an extensive overview of quantitative XAI
evaluation methods. This systematic collection of evaluation methods provides
researchers and practitioners with concrete tools to thoroughly validate,
benchmark and compare new and existing XAI methods. This also opens up
opportunities to include quantitative metrics as optimization criteria during
model training in order to optimize for accuracy and interpretability
simultaneously.
Related papers
- BEExAI: Benchmark to Evaluate Explainable AI [0.9176056742068812]
We propose BEExAI, a benchmark tool that allows large-scale comparison of different post-hoc XAI methods.
We argue that the need for a reliable way of measuring the quality and correctness of explanations is becoming critical.
arXiv Detail & Related papers (2024-07-29T11:21:17Z) - FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets [69.91340332545094]
We introduce FLASK, a fine-grained evaluation protocol for both human-based and model-based evaluation.
We experimentally observe that the fine-graininess of evaluation is crucial for attaining a holistic view of model performance.
arXiv Detail & Related papers (2023-07-20T14:56:35Z) - Explaining Explainability: Towards Deeper Actionable Insights into Deep
Learning through Second-order Explainability [70.60433013657693]
Second-order explainable AI (SOXAI) was recently proposed to extend explainable AI (XAI) from the instance level to the dataset level.
We demonstrate for the first time, via example classification and segmentation cases, that eliminating irrelevant concepts from the training set based on actionable insights from SOXAI can enhance a model's performance.
arXiv Detail & Related papers (2023-06-14T23:24:01Z) - An Experimental Investigation into the Evaluation of Explainability
Methods [60.54170260771932]
This work compares 14 different metrics when applied to nine state-of-the-art XAI methods and three dummy methods (e.g., random saliency maps) used as references.
Experimental results show which of these metrics produces highly correlated results, indicating potential redundancy.
arXiv Detail & Related papers (2023-05-25T08:07:07Z) - Better Understanding Differences in Attribution Methods via Systematic Evaluations [57.35035463793008]
Post-hoc attribution methods have been proposed to identify image regions most influential to the models' decisions.
We propose three novel evaluation schemes to more reliably measure the faithfulness of those methods.
We use these evaluation schemes to study strengths and shortcomings of some widely used attribution methods over a wide range of models.
arXiv Detail & Related papers (2023-03-21T14:24:58Z) - The Meta-Evaluation Problem in Explainable AI: Identifying Reliable
Estimators with MetaQuantus [10.135749005469686]
One of the unsolved challenges in the field of Explainable AI (XAI) is determining how to most reliably estimate the quality of an explanation method.
We address this issue through a meta-evaluation of different quality estimators in XAI.
Our novel framework, MetaQuantus, analyses two complementary performance characteristics of a quality estimator.
arXiv Detail & Related papers (2023-02-14T18:59:02Z) - Crowdsourcing Evaluation of Saliency-based XAI Methods [18.18238526746074]
We propose a new human-based evaluation scheme using crowdsourcing to evaluate XAI methods.
Our method is inspired by a human computation game, "Peek-a-boom"
We evaluate the saliency maps of various XAI methods on two datasets with automated and crowd-based evaluation schemes.
arXiv Detail & Related papers (2021-06-27T17:37:53Z) - Developing a Fidelity Evaluation Approach for Interpretable Machine
Learning [2.2448567386846916]
Explainable AI (XAI) methods are used to improve the interpretability of complex models.
In particular, methods to evaluate the fidelity of the explanation to the underlying black box require further development.
Our evaluations suggest that the internal mechanism of the underlying predictive model, the internal mechanism of the explainable method used and model and data complexity all affect explanation fidelity.
arXiv Detail & Related papers (2021-06-16T00:21:16Z) - A Diagnostic Study of Explainability Techniques for Text Classification [52.879658637466605]
We develop a list of diagnostic properties for evaluating existing explainability techniques.
We compare the saliency scores assigned by the explainability techniques with human annotations of salient input regions to find relations between a model's performance and the agreement of its rationales with human ones.
arXiv Detail & Related papers (2020-09-25T12:01:53Z) - PONE: A Novel Automatic Evaluation Metric for Open-Domain Generative
Dialogue Systems [48.99561874529323]
There are three kinds of automatic methods to evaluate the open-domain generative dialogue systems.
Due to the lack of systematic comparison, it is not clear which kind of metrics are more effective.
We propose a novel and feasible learning-based metric that can significantly improve the correlation with human judgments.
arXiv Detail & Related papers (2020-04-06T04:36:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.