Related papers: A Systematic Review of User-Centred Evaluation of Explainable AI in Healthcare

A Systematic Review of User-Centred Evaluation of Explainable AI in Healthcare

URL: http://arxiv.org/abs/2506.13904v1
Date: Mon, 16 Jun 2025 18:30:00 GMT
Title: A Systematic Review of User-Centred Evaluation of Explainable AI in Healthcare
Authors: Ivania Donoso-Guzmán, Kristýna Sirka Kacafírková, Maxwell Szymanski, An Jacobs, Denis Parra, Katrien Verbert,
Abstract summary: This study aims to develop a framework of well-defined, atomic properties that characterise the user experience of XAI in healthcare.<n>We also provide context-sensitive guidelines for defining evaluation strategies based on system characteristics.
Score: 1.57531613028502
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite promising developments in Explainable Artificial Intelligence, the practical value of XAI methods remains under-explored and insufficiently validated in real-world settings. Robust and context-aware evaluation is essential, not only to produce understandable explanations but also to ensure their trustworthiness and usability for intended users, but tends to be overlooked because of no clear guidelines on how to design an evaluation with users. This study addresses this gap with two main goals: (1) to develop a framework of well-defined, atomic properties that characterise the user experience of XAI in healthcare; and (2) to provide clear, context-sensitive guidelines for defining evaluation strategies based on system characteristics. We conducted a systematic review of 82 user studies, sourced from five databases, all situated within healthcare settings and focused on evaluating AI-generated explanations. The analysis was guided by a predefined coding scheme informed by an existing evaluation framework, complemented by inductive codes developed iteratively. The review yields three key contributions: (1) a synthesis of current evaluation practices, highlighting a growing focus on human-centred approaches in healthcare XAI; (2) insights into the interrelations among explanation properties; and (3) an updated framework and a set of actionable guidelines to support interdisciplinary teams in designing and implementing effective evaluation strategies for XAI systems tailored to specific application contexts.

Related papers

Medical Reasoning in the Era of LLMs: A Systematic Review of Enhancement Techniques and Applications [59.721265428780946]
Large Language Models (LLMs) in medicine have enabled impressive capabilities, yet a critical gap remains in their ability to perform systematic, transparent, and verifiable reasoning.<n>This paper provides the first systematic review of this emerging field.<n>We propose a taxonomy of reasoning enhancement techniques, categorized into training-time strategies and test-time mechanisms.
arXiv Detail & Related papers (2025-08-01T14:41:31Z)
Developing and Maintaining an Open-Source Repository of AI Evaluations: Challenges and Insights [44.99833362998488]
This paper presents practical insights from eight months of maintaining $_evals$, an open-source repository of 70+ community-contributed AI evaluations.<n>We identify key challenges in implementing and maintaining AI evaluations and develop solutions.
arXiv Detail & Related papers (2025-07-09T14:30:45Z)
A Conceptual Framework for AI Capability Evaluations [0.0]
We propose a conceptual framework for analyzing AI capability evaluations.<n>It offers a structured, descriptive approach that systematizes the analysis of widely used methods and terminology.<n>It also enables researchers to identify methodological weaknesses, assists practitioners in designing evaluations, and provides policymakers with a tool to scrutinize, compare, and navigate complex evaluation landscapes.
arXiv Detail & Related papers (2025-06-23T00:19:27Z)
Evaluating Explainability: A Framework for Systematic Assessment and Reporting of Explainable AI Features [2.4458403938995064]
We propose a framework to assess and report explainable AI features.<n>Our evaluation framework is based on four criteria: 1) Consistency quantifies the variability of explanations to similar inputs, 2) Plausibility estimates how close the explanation is to the ground truth, 3) Fidelity assesses the alignment between the explanation and the model internal mechanisms, and 4) Usefulness evaluates the impact on task performance.
arXiv Detail & Related papers (2025-06-16T18:51:46Z)
SPHERE: An Evaluation Card for Human-AI Systems [75.0887588648484]
We present an evaluation card SPHERE, which encompasses five key dimensions.<n>We conduct a review of 39 human-AI systems using SPHERE, outlining current evaluation practices and areas for improvement.
arXiv Detail & Related papers (2025-03-24T20:17:20Z)
Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [48.87360916431396]
We introduce MedR-Bench, a benchmarking dataset of 1,453 structured patient cases, annotated with reasoning references.<n>We propose a framework encompassing three critical examination recommendation, diagnostic decision-making, and treatment planning, simulating the entire patient care journey.<n>Using this benchmark, we evaluate five state-of-the-art reasoning LLMs, including DeepSeek-R1, OpenAI-o3-mini, and Gemini-2.0-Flash Thinking, etc.
arXiv Detail & Related papers (2025-03-06T18:35:39Z)
Learning to Align Multi-Faceted Evaluation: A Unified and Robust Framework [61.38174427966444]
Large Language Models (LLMs) are being used more and more extensively for automated evaluation in various scenarios.<n>Previous studies have attempted to fine-tune open-source LLMs to replicate the evaluation explanations and judgments of powerful proprietary models.<n>We propose a novel evaluation framework, ARJudge, that adaptively formulates evaluation criteria and synthesizes both text-based and code-driven analyses.
arXiv Detail & Related papers (2025-02-26T06:31:45Z)
A Unified Framework for Evaluating the Effectiveness and Enhancing the Transparency of Explainable AI Methods in Real-World Applications [2.0681376988193843]
"Black box" characteristic of AI models constrains interpretability, transparency, and reliability.<n>This study presents a unified XAI evaluation framework to evaluate correctness, interpretability, robustness, fairness, and completeness of explanations generated by AI models.
arXiv Detail & Related papers (2024-12-05T05:30:10Z)
Towards a Comprehensive Human-Centred Evaluation Framework for Explainable AI [1.7222662622390634]
We propose to adapt the User-Centric Evaluation Framework used in recommender systems. We integrate explanation aspects, summarise explanation properties, indicate relations between them, and categorise metrics that measure these properties.
arXiv Detail & Related papers (2023-07-31T09:20:16Z)
The Meta-Evaluation Problem in Explainable AI: Identifying Reliable Estimators with MetaQuantus [10.135749005469686]
One of the unsolved challenges in the field of Explainable AI (XAI) is determining how to most reliably estimate the quality of an explanation method. We address this issue through a meta-evaluation of different quality estimators in XAI. Our novel framework, MetaQuantus, analyses two complementary performance characteristics of a quality estimator.
arXiv Detail & Related papers (2023-02-14T18:59:02Z)
Connecting Algorithmic Research and Usage Contexts: A Perspective of Contextualized Evaluation for Explainable AI [65.44737844681256]
A lack of consensus on how to evaluate explainable AI (XAI) hinders the advancement of the field. We argue that one way to close the gap is to develop evaluation methods that account for different user requirements.
arXiv Detail & Related papers (2022-06-22T05:17:33Z)
From Anecdotal Evidence to Quantitative Evaluation Methods: A Systematic Review on Evaluating Explainable AI [3.7592122147132776]
We identify 12 conceptual properties, such as Compactness and Correctness, that should be evaluated for comprehensively assessing the quality of an explanation. We find that 1 in 3 papers evaluate exclusively with anecdotal evidence, and 1 in 5 papers evaluate with users. This systematic collection of evaluation methods provides researchers and practitioners with concrete tools to thoroughly validate, benchmark and compare new and existing XAI methods.
arXiv Detail & Related papers (2022-01-20T13:23:20Z)
Opportunities of a Machine Learning-based Decision Support System for Stroke Rehabilitation Assessment [64.52563354823711]
Rehabilitation assessment is critical to determine an adequate intervention for a patient. Current practices of assessment mainly rely on therapist's experience, and assessment is infrequently executed due to the limited availability of a therapist. We developed an intelligent decision support system that can identify salient features of assessment using reinforcement learning.
arXiv Detail & Related papers (2020-02-27T17:04:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.