Towards a Comprehensive Human-Centred Evaluation Framework for
Explainable AI
- URL: http://arxiv.org/abs/2308.06274v1
- Date: Mon, 31 Jul 2023 09:20:16 GMT
- Title: Towards a Comprehensive Human-Centred Evaluation Framework for
Explainable AI
- Authors: Ivania Donoso-Guzm\'an, Jeroen Ooge, Denis Parra, Katrien Verbert
- Abstract summary: We propose to adapt the User-Centric Evaluation Framework used in recommender systems.
We integrate explanation aspects, summarise explanation properties, indicate relations between them, and categorise metrics that measure these properties.
- Score: 1.7222662622390634
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While research on explainable AI (XAI) is booming and explanation techniques
have proven promising in many application domains, standardised human-centred
evaluation procedures are still missing. In addition, current evaluation
procedures do not assess XAI methods holistically in the sense that they do not
treat explanations' effects on humans as a complex user experience. To tackle
this challenge, we propose to adapt the User-Centric Evaluation Framework used
in recommender systems: we integrate explanation aspects, summarise explanation
properties, indicate relations between them, and categorise metrics that
measure these properties. With this comprehensive evaluation framework, we hope
to contribute to the human-centred standardisation of XAI evaluation.
Related papers
- Dimensions of Generative AI Evaluation Design [51.541816010127256]
We propose a set of general dimensions that capture critical choices involved in GenAI evaluation design.
These dimensions include the evaluation setting, the task type, the input source, the interaction style, the duration, the metric type, and the scoring method.
arXiv Detail & Related papers (2024-11-19T18:25:30Z) - Hierarchical Evaluation Framework: Best Practices for Human Evaluation [17.91641890651225]
The absence of widely accepted human evaluation metrics in NLP hampers fair comparisons among different systems and the establishment of universal assessment standards.
We develop our own hierarchical evaluation framework to provide a more comprehensive representation of the NLP system's performance.
In future work, we will investigate the potential time-saving benefits of our proposed framework for evaluators assessing NLP systems.
arXiv Detail & Related papers (2023-10-03T09:46:02Z) - An Experimental Investigation into the Evaluation of Explainability
Methods [60.54170260771932]
This work compares 14 different metrics when applied to nine state-of-the-art XAI methods and three dummy methods (e.g., random saliency maps) used as references.
Experimental results show which of these metrics produces highly correlated results, indicating potential redundancy.
arXiv Detail & Related papers (2023-05-25T08:07:07Z) - A System's Approach Taxonomy for User-Centred XAI: A Survey [0.6882042556551609]
We propose a unified, inclusive and user-centred taxonomy for XAI based on the principles of General System's Theory.
This provides a basis for evaluating the appropriateness of XAI approaches for all user types, including both developers and end users.
arXiv Detail & Related papers (2023-03-06T00:50:23Z) - The Meta-Evaluation Problem in Explainable AI: Identifying Reliable
Estimators with MetaQuantus [10.135749005469686]
One of the unsolved challenges in the field of Explainable AI (XAI) is determining how to most reliably estimate the quality of an explanation method.
We address this issue through a meta-evaluation of different quality estimators in XAI.
Our novel framework, MetaQuantus, analyses two complementary performance characteristics of a quality estimator.
arXiv Detail & Related papers (2023-02-14T18:59:02Z) - Revisiting the Gold Standard: Grounding Summarization Evaluation with
Robust Human Evaluation [136.16507050034755]
Existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have insufficient scale.
We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which is based on fine-grained semantic units.
We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of 22,000 summary-level annotations over 28 top-performing systems.
arXiv Detail & Related papers (2022-12-15T17:26:05Z) - Towards Human Cognition Level-based Experiment Design for Counterfactual
Explanations (XAI) [68.8204255655161]
The emphasis of XAI research appears to have turned to a more pragmatic explanation approach for better understanding.
An extensive area where cognitive science research may substantially influence XAI advancements is evaluating user knowledge and feedback.
We propose a framework to experiment with generating and evaluating the explanations on the grounds of different cognitive levels of understanding.
arXiv Detail & Related papers (2022-10-31T19:20:22Z) - Connecting Algorithmic Research and Usage Contexts: A Perspective of
Contextualized Evaluation for Explainable AI [65.44737844681256]
A lack of consensus on how to evaluate explainable AI (XAI) hinders the advancement of the field.
We argue that one way to close the gap is to develop evaluation methods that account for different user requirements.
arXiv Detail & Related papers (2022-06-22T05:17:33Z) - From Anecdotal Evidence to Quantitative Evaluation Methods: A Systematic
Review on Evaluating Explainable AI [3.7592122147132776]
We identify 12 conceptual properties, such as Compactness and Correctness, that should be evaluated for comprehensively assessing the quality of an explanation.
We find that 1 in 3 papers evaluate exclusively with anecdotal evidence, and 1 in 5 papers evaluate with users.
This systematic collection of evaluation methods provides researchers and practitioners with concrete tools to thoroughly validate, benchmark and compare new and existing XAI methods.
arXiv Detail & Related papers (2022-01-20T13:23:20Z) - Crowdsourcing Evaluation of Saliency-based XAI Methods [18.18238526746074]
We propose a new human-based evaluation scheme using crowdsourcing to evaluate XAI methods.
Our method is inspired by a human computation game, "Peek-a-boom"
We evaluate the saliency maps of various XAI methods on two datasets with automated and crowd-based evaluation schemes.
arXiv Detail & Related papers (2021-06-27T17:37:53Z) - Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy
Evaluation Approach [84.02388020258141]
We propose a new framework named ENIGMA for estimating human evaluation scores based on off-policy evaluation in reinforcement learning.
ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation.
Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores.
arXiv Detail & Related papers (2021-02-20T03:29:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.