Related papers: Do We Need Another Explainable AI Method? Toward Unifying Post-hoc XAI Evaluation Methods into an Interactive and Multi-dimensional Benchmark

Do We Need Another Explainable AI Method? Toward Unifying Post-hoc XAI Evaluation Methods into an Interactive and Multi-dimensional Benchmark

URL: http://arxiv.org/abs/2207.14160v2
Date: Tue, 4 Oct 2022 10:45:23 GMT
Title: Do We Need Another Explainable AI Method? Toward Unifying Post-hoc XAI Evaluation Methods into an Interactive and Multi-dimensional Benchmark
Authors: Mohamed Karim Belaid, Eyke H\"ullermeier, Maximilian Rabus, Ralf Krestel
Abstract summary: We propose Compare-xAI, a benchmark that unifies all exclusive functional testing methods applied to xAI algorithms. The benchmark encapsulates the complexity of evaluating xAI methods into a hierarchical scoring of three levels. The interactive user interface helps mitigate errors in interpreting xAI results.
Score: 6.511859672210113
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: In recent years, Explainable AI (xAI) attracted a lot of attention as various countries turned explanations into a legal right. xAI allows for improving models beyond the accuracy metric by, e.g., debugging the learned pattern and demystifying the AI's behavior. The widespread use of xAI brought new challenges. On the one hand, the number of published xAI algorithms underwent a boom, and it became difficult for practitioners to select the right tool. On the other hand, some experiments did highlight how easy data scientists could misuse xAI algorithms and misinterpret their results. To tackle the issue of comparing and correctly using feature importance xAI algorithms, we propose Compare-xAI, a benchmark that unifies all exclusive functional testing methods applied to xAI algorithms. We propose a selection protocol to shortlist non-redundant functional tests from the literature, i.e., each targeting a specific end-user requirement in explaining a model. The benchmark encapsulates the complexity of evaluating xAI methods into a hierarchical scoring of three levels, namely, targeting three end-user groups: researchers, practitioners, and laymen in xAI. The most detailed level provides one score per test. The second level regroups tests into five categories (fidelity, fragility, stability, simplicity, and stress tests). The last level is the aggregated comprehensibility score, which encapsulates the ease of correctly interpreting the algorithm's output in one easy to compare value. Compare-xAI's interactive user interface helps mitigate errors in interpreting xAI results by quickly listing the recommended xAI solutions for each ML task and their current limitations. The benchmark is made available at https://karim-53.github.io/cxai/

Related papers

Touchstone Benchmark: Are We on the Right Way for Evaluating AI Algorithms for Medical Segmentation? [90.30635552818875]
We present Touchstone, a large-scale collaborative segmentation benchmark of 9 types of abdominal organs. This benchmark is based on 5,195 training CT scans from 76 hospitals around the world and 5,903 testing CT scans from 11 additional hospitals. We invited 14 inventors of 19 AI algorithms to train their algorithms, while our team, as a third party, independently evaluated these algorithms on three test sets.
arXiv Detail & Related papers (2024-11-06T05:09:34Z)
A Comparative Study on Reasoning Patterns of OpenAI's o1 Model [69.08287909042421]
We show that OpenAI's o1 model has achieved the best performance on most datasets. We also provide a detailed analysis on several reasoning benchmarks.
arXiv Detail & Related papers (2024-10-17T15:09:03Z)
Navigating the Maze of Explainable AI: A Systematic Approach to Evaluating Methods and Metrics [10.045644410833402]
We introduce LATEC, a large-scale benchmark that critically evaluates 17 prominent XAI methods using 20 distinct metrics. We showcase the high risk of conflicting metrics leading to unreliable rankings and consequently propose a more robust evaluation scheme. LATEC reinforces its role in future XAI research by publicly releasing all 326k saliency maps and 378k metric scores as a (meta-evaluation) dataset.
arXiv Detail & Related papers (2024-09-25T09:07:46Z)
An Item Response Theory-based R Module for Algorithm Portfolio Analysis [2.8642825441965645]
This paper introduces an Item Response Theory based analysis tool for algorithm portfolio evaluation called AIRT-Module. Adapting IRT to algorithm evaluation, the AIRT-Module contains a Shiny web application and the R package airt. The strengths and weaknesses of algorithms are visualised using the difficulty spectrum of the test instances.
arXiv Detail & Related papers (2024-08-26T05:31:46Z)
Precise Benchmarking of Explainable AI Attribution Methods [0.0]
We propose a novel evaluation approach for benchmarking state-of-the-art XAI attribution methods. Our proposal consists of a synthetic classification model accompanied by its derived ground truth explanations. Our experimental results provide novel insights into the performance of Guided-Backprop and Smoothgrad XAI methods.
arXiv Detail & Related papers (2023-08-06T17:03:32Z)
An Experimental Investigation into the Evaluation of Explainability Methods [60.54170260771932]
This work compares 14 different metrics when applied to nine state-of-the-art XAI methods and three dummy methods (e.g., random saliency maps) used as references. Experimental results show which of these metrics produces highly correlated results, indicating potential redundancy.
arXiv Detail & Related papers (2023-05-25T08:07:07Z)
A Gold Standard Dataset for the Reviewer Assignment Problem [117.59690218507565]
"Similarity score" is a numerical estimate of the expertise of a reviewer in reviewing a paper. Our dataset consists of 477 self-reported expertise scores provided by 58 researchers. For the task of ordering two papers in terms of their relevance for a reviewer, the error rates range from 12%-30% in easy cases to 36%-43% in hard cases.
arXiv Detail & Related papers (2023-03-23T16:15:03Z)
Understanding User Preferences in Explainable Artificial Intelligence: A Survey and a Mapping Function Proposal [0.0]
This study conducts a thorough review of extant research in Explainable Machine Learning (XML) Our main objective is to offer a classification of XAI methods within the realm of XML. We propose a mapping function that take to account users and their desired properties and suggest an XAI method to them.
arXiv Detail & Related papers (2023-02-07T01:06:38Z)
Responsibility: An Example-based Explainable AI approach via Training Process Inspection [1.4610038284393165]
We present a novel XAI approach that identifies the most responsible training example for a particular decision. This example can then be shown as an explanation: "this is what I (the AI) learned that led me to do that" Our results demonstrate that responsibility can help improve accuracy for both human end users and secondary ML models.
arXiv Detail & Related papers (2022-09-07T19:30:01Z)
Connecting Algorithmic Research and Usage Contexts: A Perspective of Contextualized Evaluation for Explainable AI [65.44737844681256]
A lack of consensus on how to evaluate explainable AI (XAI) hinders the advancement of the field. We argue that one way to close the gap is to develop evaluation methods that account for different user requirements.
arXiv Detail & Related papers (2022-06-22T05:17:33Z)
A User-Centred Framework for Explainable Artificial Intelligence in Human-Robot Interaction [70.11080854486953]
We propose a user-centred framework for XAI that focuses on its social-interactive aspect. The framework aims to provide a structure for interactive XAI solutions thought for non-expert users.
arXiv Detail & Related papers (2021-09-27T09:56:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.