Related papers: Recommendations on test datasets for evaluating AI solutions in pathology

Recommendations on test datasets for evaluating AI solutions in pathology

URL: http://arxiv.org/abs/2204.14226v1
Date: Thu, 21 Apr 2022 14:52:47 GMT
Title: Recommendations on test datasets for evaluating AI solutions in pathology
Authors: Andr\'e Homeyer, Christian Gei{\ss}ler, Lars Ole Schwen, Falk Zakrzewski, Theodore Evans, Klaus Strohmenger, Max Westphal, Roman David B\"ulow, Michaela Kargl, Aray Karjauv, Isidre Munn\'e-Bertran, Carl Orge Retzlaff, Adri\`a Romero-L\'opez, Tomasz So{\l}tysi\'nski, Markus Plass, Rita Carvalho, Peter Steinbach, Yu-Chia Lan, Nassim Bouteldja, David Haber, Mateo Rojas-Carulla, Alireza Vafaei Sadr, Matthias Kraft, Daniel Kr\"uger, Rutger Fick, Tobias Lang, Peter Boor, Heimo M\"uller, Peter Hufnagl, Norman Zerbe
Abstract summary: AI solutions that automatically extract information from digital histology images have shown great promise for improving pathological diagnosis. Prior to routine use, it is important to evaluate their predictive performance and obtain regulatory approval. A committee of various stakeholders, including commercial AI developers, pathologists, and researchers, discussed key aspects and conducted extensive literature reviews on test datasets in pathology.
Score: 2.001521933638504
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Artificial intelligence (AI) solutions that automatically extract information from digital histology images have shown great promise for improving pathological diagnosis. Prior to routine use, it is important to evaluate their predictive performance and obtain regulatory approval. This assessment requires appropriate test datasets. However, compiling such datasets is challenging and specific recommendations are missing. A committee of various stakeholders, including commercial AI developers, pathologists, and researchers, discussed key aspects and conducted extensive literature reviews on test datasets in pathology. Here, we summarize the results and derive general recommendations for the collection of test datasets. We address several questions: Which and how many images are needed? How to deal with low-prevalence subsets? How can potential bias be detected? How should datasets be reported? What are the regulatory requirements in different countries? The recommendations are intended to help AI developers demonstrate the utility of their products and to help regulatory agencies and end users verify reported performance measures. Further research is needed to formulate criteria for sufficiently representative test datasets so that AI solutions can operate with less user intervention and better support diagnostic workflows in the future.

Related papers

Medical Data Pecking: A Context-Aware Approach for Automated Quality Evaluation of Structured Medical Data [5.681039620785591]
EHR data often contain significant quality issues, including misrepresentations of subpopulations, biases, and systematic errors.<n> Existing quality assessment methods remain insufficient, lacking systematic procedures to assess data fitness for research.<n>We present the Medical Data Pecking approach, which adapts unit testing and coverage concepts from software engineering to identify data quality concerns.
arXiv Detail & Related papers (2025-07-03T13:54:50Z)
RAID: A Dataset for Testing the Adversarial Robustness of AI-Generated Image Detectors [57.81012948133832]
We present RAID (Robust evaluation of AI-generated image Detectors), a dataset of 72k diverse and highly transferable adversarial examples.<n>Our methodology generates adversarial images that transfer with a high success rate to unseen detectors.<n>Our findings indicate that current state-of-the-art AI-generated image detectors can be easily deceived by adversarial examples.
arXiv Detail & Related papers (2025-06-04T14:16:00Z)
TestAgent: An Adaptive and Intelligent Expert for Human Assessment [62.060118490577366]
We propose TestAgent, a large language model (LLM)-powered agent designed to enhance adaptive testing through interactive engagement.<n>TestAgent supports personalized question selection, captures test-takers' responses and anomalies, and provides precise outcomes through dynamic, conversational interactions.
arXiv Detail & Related papers (2025-06-03T16:07:54Z)
Detecting Dataset Bias in Medical AI: A Generalized and Modality-Agnostic Auditing Framework [8.017827642932746]
Generalized Attribute Utility and Detectability-Induced bias Testing (G-AUDIT) for datasets is a modality-agnostic dataset auditing framework.<n>Our method examines the relationship between task-level annotations and data properties including patient attributes.<n>G-AUDIT successfully identifies subtle biases commonly overlooked by traditional qualitative methods.
arXiv Detail & Related papers (2025-03-13T02:16:48Z)
Datasheets for Healthcare AI: A Framework for Transparency and Bias Mitigation [0.0]
Bias, data incompleteness, and inaccuracies in training datasets can lead to unfair outcomes and amplify existing disparities. We propose a dataset documentation framework that promotes transparency and ensures alignment with regulatory requirements. The findings emphasise the importance of dataset documentation in fostering responsible AI development.
arXiv Detail & Related papers (2025-01-09T23:36:34Z)
A Classification Benchmark for Artificial Intelligence Detection of Laryngeal Cancer from Patient Speech [0.30723404270319693]
Current diagnostic pathways cause many patients to be incorrectly referred to urgent suspected cancer pathways. Artificial intelligence offers a promising solution by enabling non-invasive detection of laryngeal cancer from patient speech. This work introduces a benchmark suite comprising 36 models trained and evaluated on open-source datasets.
arXiv Detail & Related papers (2024-12-20T10:34:03Z)
AI in radiological imaging of soft-tissue and bone tumours: a systematic review evaluating against CLAIM and FUTURE-AI guidelines [1.5332408886895255]
Soft-tissue and bone tumours (STBT) are rare, diagnostically challenging lesions with variable clinical behaviours and treatment approaches. This systematic review provides an overview of Artificial Intelligence (AI) methods using radiological imaging for diagnosis and prognosis of these tumours.
arXiv Detail & Related papers (2024-08-22T15:31:48Z)
TrialBench: Multi-Modal Artificial Intelligence-Ready Clinical Trial Datasets [57.067409211231244]
This paper presents meticulously curated AIready datasets covering multi-modal data (e.g., drug molecule, disease code, text, categorical/numerical features) and 8 crucial prediction challenges in clinical trial design. We provide basic validation methods for each task to ensure the datasets' usability and reliability. We anticipate that the availability of such open-access datasets will catalyze the development of advanced AI approaches for clinical trial design.
arXiv Detail & Related papers (2024-06-30T09:13:10Z)
A Comprehensive Library for Benchmarking Multi-class Visual Anomaly Detection [52.228708947607636]
This paper introduces a comprehensive visual anomaly detection benchmark, ADer, which is a modular framework for new methods. The benchmark includes multiple datasets from industrial and medical domains, implementing fifteen state-of-the-art methods and nine comprehensive metrics. We objectively reveal the strengths and weaknesses of different methods and provide insights into the challenges and future directions of multi-class visual anomaly detection.
arXiv Detail & Related papers (2024-06-05T13:40:07Z)
A Survey of Artificial Intelligence in Gait-Based Neurodegenerative Disease Diagnosis [51.07114445705692]
neurodegenerative diseases (NDs) traditionally require extensive healthcare resources and human effort for medical diagnosis and monitoring. As a crucial disease-related motor symptom, human gait can be exploited to characterize different NDs. The current advances in artificial intelligence (AI) models enable automatic gait analysis for NDs identification and classification.
arXiv Detail & Related papers (2024-05-21T06:44:40Z)
Challenges for Responsible AI Design and Workflow Integration in Healthcare: A Case Study of Automatic Feeding Tube Qualification in Radiology [35.284458448940796]
Nasogastric tubes (NGTs) are feeding tubes that are inserted through the nose into the stomach to deliver nutrition or medication. Recent AI developments demonstrate the feasibility of robustly detecting NGT placement from Chest X-ray images. We present a human-centered approach to the problem and describe insights derived following contextual inquiry and in-depth interviews with 15 clinical stakeholders.
arXiv Detail & Related papers (2024-05-08T14:16:22Z)
BESTMVQA: A Benchmark Evaluation System for Medical Visual Question Answering [8.547600133510551]
This paper develops a Benchmark Evaluation SysTem for Medical Visual Question Answering, denoted by BESTMVQA. Our system provides a useful tool for users to automatically build Med-VQA datasets, which helps overcoming the data insufficient problem. With simple configurations, our system automatically trains and evaluates the selected models over a benchmark dataset.
arXiv Detail & Related papers (2023-12-13T03:08:48Z)
From Static Benchmarks to Adaptive Testing: Psychometrics in AI Evaluation [60.14902811624433]
We discuss a paradigm shift from static evaluation methods to adaptive testing. This involves estimating the characteristics and value of each test item in the benchmark and dynamically adjusting items in real-time. We analyze the current approaches, advantages, and underlying reasons for adopting psychometrics in AI evaluation.
arXiv Detail & Related papers (2023-06-18T09:54:33Z)
Informing clinical assessment by contextualizing post-hoc explanations of risk prediction models in type-2 diabetes [50.8044927215346]
We consider a comorbidity risk prediction scenario and focus on contexts regarding the patients clinical state. We employ several state-of-the-art LLMs to present contexts around risk prediction model inferences and evaluate their acceptability. Our paper is one of the first end-to-end analyses identifying the feasibility and benefits of contextual explanations in a real-world clinical use case.
arXiv Detail & Related papers (2023-02-11T18:07:11Z)
Weakly Supervised Anomaly Detection: A Survey [75.26180038443462]
Anomaly detection (AD) is a crucial task in machine learning with various applications. We present the first comprehensive survey of weakly supervised anomaly detection (WSAD) methods. For each setting, we provide formal definitions, key algorithms, and potential future directions.
arXiv Detail & Related papers (2023-02-09T10:27:21Z)
Benchmark datasets driving artificial intelligence development fail to capture the needs of medical professionals [4.799783526620609]
We released a catalogue of datasets and benchmarks pertaining to the broad domain of clinical and biomedical natural language processing (NLP) A total of 450 NLP datasets were manually systematized and annotated with rich metadata. Our analysis indicates that AI benchmarks of direct clinical relevance are scarce and fail to cover most work activities that clinicians want to see addressed.
arXiv Detail & Related papers (2022-01-18T15:05:28Z)
Hemogram Data as a Tool for Decision-making in COVID-19 Management: Applications to Resource Scarcity Scenarios [62.997667081978825]
COVID-19 pandemics has challenged emergency response systems worldwide, with widespread reports of essential services breakdown and collapse of health care structure. This work describes a machine learning model derived from hemogram exam data performed in symptomatic patients. Proposed models can predict COVID-19 qRT-PCR results in symptomatic individuals with high accuracy, sensitivity and specificity.
arXiv Detail & Related papers (2020-05-10T01:45:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.