Crowdsourcing Evaluation of Saliency-based XAI Methods
- URL: http://arxiv.org/abs/2107.00456v1
- Date: Sun, 27 Jun 2021 17:37:53 GMT
- Title: Crowdsourcing Evaluation of Saliency-based XAI Methods
- Authors: Xiaotian Lu, Arseny Tolmachev, Tatsuya Yamamoto, Koh Takeuchi, Seiji
Okajima, Tomoyoshi Takebayashi, Koji Maruhashi, Hisashi Kashima
- Abstract summary: We propose a new human-based evaluation scheme using crowdsourcing to evaluate XAI methods.
Our method is inspired by a human computation game, "Peek-a-boom"
We evaluate the saliency maps of various XAI methods on two datasets with automated and crowd-based evaluation schemes.
- Score: 18.18238526746074
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Understanding the reasons behind the predictions made by deep neural networks
is critical for gaining human trust in many important applications, which is
reflected in the increasing demand for explainability in AI (XAI) in recent
years. Saliency-based feature attribution methods, which highlight important
parts of images that contribute to decisions by classifiers, are often used as
XAI methods, especially in the field of computer vision. In order to compare
various saliency-based XAI methods quantitatively, several approaches for
automated evaluation schemes have been proposed; however, there is no guarantee
that such automated evaluation metrics correctly evaluate explainability, and a
high rating by an automated evaluation scheme does not necessarily mean a high
explainability for humans. In this study, instead of the automated evaluation,
we propose a new human-based evaluation scheme using crowdsourcing to evaluate
XAI methods. Our method is inspired by a human computation game, "Peek-a-boom",
and can efficiently compare different XAI methods by exploiting the power of
crowds. We evaluate the saliency maps of various XAI methods on two datasets
with automated and crowd-based evaluation schemes. Our experiments show that
the result of our crowd-based evaluation scheme is different from those of
automated evaluation schemes. In addition, we regard the crowd-based evaluation
results as ground truths and provide a quantitative performance measure to
compare different automated evaluation schemes. We also discuss the impact of
crowd workers on the results and show that the varying ability of crowd workers
does not significantly impact the results.
Related papers
- Navigating the Maze of Explainable AI: A Systematic Approach to Evaluating Methods and Metrics [10.045644410833402]
We introduce LATEC, a large-scale benchmark that critically evaluates 17 prominent XAI methods using 20 distinct metrics.
We showcase the high risk of conflicting metrics leading to unreliable rankings and consequently propose a more robust evaluation scheme.
LATEC reinforces its role in future XAI research by publicly releasing all 326k saliency maps and 378k metric scores as a (meta-evaluation) dataset.
arXiv Detail & Related papers (2024-09-25T09:07:46Z) - How much informative is your XAI? A decision-making assessment task to
objectively measure the goodness of explanations [53.01494092422942]
The number and complexity of personalised and user-centred approaches to XAI have rapidly grown in recent years.
It emerged that user-centred approaches to XAI positively affect the interaction between users and systems.
We propose an assessment task to objectively and quantitatively measure the goodness of XAI systems.
arXiv Detail & Related papers (2023-12-07T15:49:39Z) - From Static Benchmarks to Adaptive Testing: Psychometrics in AI Evaluation [60.14902811624433]
We discuss a paradigm shift from static evaluation methods to adaptive testing.
This involves estimating the characteristics and value of each test item in the benchmark and dynamically adjusting items in real-time.
We analyze the current approaches, advantages, and underlying reasons for adopting psychometrics in AI evaluation.
arXiv Detail & Related papers (2023-06-18T09:54:33Z) - An Experimental Investigation into the Evaluation of Explainability
Methods [60.54170260771932]
This work compares 14 different metrics when applied to nine state-of-the-art XAI methods and three dummy methods (e.g., random saliency maps) used as references.
Experimental results show which of these metrics produces highly correlated results, indicating potential redundancy.
arXiv Detail & Related papers (2023-05-25T08:07:07Z) - Better Understanding Differences in Attribution Methods via Systematic Evaluations [57.35035463793008]
Post-hoc attribution methods have been proposed to identify image regions most influential to the models' decisions.
We propose three novel evaluation schemes to more reliably measure the faithfulness of those methods.
We use these evaluation schemes to study strengths and shortcomings of some widely used attribution methods over a wide range of models.
arXiv Detail & Related papers (2023-03-21T14:24:58Z) - Towards Better Understanding Attribution Methods [77.1487219861185]
Post-hoc attribution methods have been proposed to identify image regions most influential to the models' decisions.
We propose three novel evaluation schemes to more reliably measure the faithfulness of those methods.
We also propose a post-processing smoothing step that significantly improves the performance of some attribution methods.
arXiv Detail & Related papers (2022-05-20T20:50:17Z) - From Anecdotal Evidence to Quantitative Evaluation Methods: A Systematic
Review on Evaluating Explainable AI [3.7592122147132776]
We identify 12 conceptual properties, such as Compactness and Correctness, that should be evaluated for comprehensively assessing the quality of an explanation.
We find that 1 in 3 papers evaluate exclusively with anecdotal evidence, and 1 in 5 papers evaluate with users.
This systematic collection of evaluation methods provides researchers and practitioners with concrete tools to thoroughly validate, benchmark and compare new and existing XAI methods.
arXiv Detail & Related papers (2022-01-20T13:23:20Z) - Dynamic Human Evaluation for Relative Model Comparisons [8.843915018287476]
We present a dynamic approach to measure the required number of human annotations when evaluating generated outputs in relative comparison settings.
We propose an agent-based framework of human evaluation to assess multiple labelling strategies and methods to decide the better model in a simulation and a crowdsourcing case study.
arXiv Detail & Related papers (2021-12-15T11:32:13Z) - Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning.
ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation.
Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z) - Proxy Tasks and Subjective Measures Can Be Misleading in Evaluating
Explainable AI Systems [14.940404609343432]
We evaluate two currently common techniques for evaluating XAI systems.
We show that evaluations with proxy tasks did not predict the results of the evaluations with the actual decision-making tasks.
Our results suggest that by employing misleading evaluation methods, our field may be inadvertently slowing its progress toward developing human+AI teams that can reliably perform better than humans or AIs alone.
arXiv Detail & Related papers (2020-01-22T22:14:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.