SUDO: a framework for evaluating clinical artificial intelligence systems without ground-truth annotations
- URL: http://arxiv.org/abs/2403.17011v1
- Date: Tue, 2 Jan 2024 18:12:03 GMT
- Title: SUDO: a framework for evaluating clinical artificial intelligence systems without ground-truth annotations
- Authors: Dani Kiyasseh, Aaron Cohen, Chengsheng Jiang, Nicholas Altieri,
- Abstract summary: We introduce SUDO, a framework for evaluating AI systems without ground-truth annotations.
We show that SUDO can be a reliable proxy for model performance and thus identify unreliable predictions.
- Score: 3.7525007896336944
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: A clinical artificial intelligence (AI) system is often validated on a held-out set of data which it has not been exposed to before (e.g., data from a different hospital with a distinct electronic health record system). This evaluation process is meant to mimic the deployment of an AI system on data in the wild; those which are currently unseen by the system yet are expected to be encountered in a clinical setting. However, when data in the wild differ from the held-out set of data, a phenomenon referred to as distribution shift, and lack ground-truth annotations, it becomes unclear the extent to which AI-based findings can be trusted on data in the wild. Here, we introduce SUDO, a framework for evaluating AI systems without ground-truth annotations. SUDO assigns temporary labels to data points in the wild and directly uses them to train distinct models, with the highest performing model indicative of the most likely label. Through experiments with AI systems developed for dermatology images, histopathology patches, and clinical reports, we show that SUDO can be a reliable proxy for model performance and thus identify unreliable predictions. We also demonstrate that SUDO informs the selection of models and allows for the previously out-of-reach assessment of algorithmic bias for data in the wild without ground-truth annotations. The ability to triage unreliable predictions for further inspection and assess the algorithmic bias of AI systems can improve the integrity of research findings and contribute to the deployment of ethical AI systems in medicine.
Related papers
- Refining Tuberculosis Detection in CXR Imaging: Addressing Bias in Deep Neural Networks via Interpretability [1.9936075659851882]
We argue that the reliability of deep learning models is limited, even if they can be shown to obtain perfect classification accuracy on the test data.
We show that pre-training a deep neural network on a large-scale proxy task, as well as using mixed objective optimization network (MOON), can improve the alignment of decision foundations between models and experts.
arXiv Detail & Related papers (2024-07-19T06:41:31Z) - Detecting algorithmic bias in medical-AI models using trees [7.939586935057782]
This paper presents an innovative framework for detecting areas of algorithmic bias in medical-AI decision support systems.
Our approach efficiently identifies potential biases in medical-AI models, specifically in the context of sepsis prediction.
arXiv Detail & Related papers (2023-12-05T18:47:34Z) - New Epochs in AI Supervision: Design and Implementation of an Autonomous
Radiology AI Monitoring System [5.50085484902146]
We introduce novel methods for monitoring the performance of radiology AI classification models in practice.
We propose two metrics - predictive divergence and temporal stability - to be used for preemptive alerts of AI performance changes.
arXiv Detail & Related papers (2023-11-24T06:29:04Z) - TREEMENT: Interpretable Patient-Trial Matching via Personalized Dynamic
Tree-Based Memory Network [54.332862955411656]
Clinical trials are critical for drug development but often suffer from expensive and inefficient patient recruitment.
In recent years, machine learning models have been proposed for speeding up patient recruitment via automatically matching patients with clinical trials.
We introduce a dynamic tree-based memory network model named TREEMENT to provide accurate and interpretable patient trial matching.
arXiv Detail & Related papers (2023-07-19T12:35:09Z) - Explainable AI for Malnutrition Risk Prediction from m-Health and
Clinical Data [3.093890460224435]
This paper presents a novel AI framework for early and explainable malnutrition risk detection based on heterogeneous m-health data.
We performed an extensive model evaluation including both subject-independent and personalised predictions.
We also investigated several benchmark XAI methods to extract global model explanations.
arXiv Detail & Related papers (2023-05-31T08:07:35Z) - Interpretable Self-Aware Neural Networks for Robust Trajectory
Prediction [50.79827516897913]
We introduce an interpretable paradigm for trajectory prediction that distributes the uncertainty among semantic concepts.
We validate our approach on real-world autonomous driving data, demonstrating superior performance over state-of-the-art baselines.
arXiv Detail & Related papers (2022-11-16T06:28:20Z) - DrugOOD: Out-of-Distribution (OOD) Dataset Curator and Benchmark for
AI-aided Drug Discovery -- A Focus on Affinity Prediction Problems with Noise
Annotations [90.27736364704108]
We present DrugOOD, a systematic OOD dataset curator and benchmark for AI-aided drug discovery.
DrugOOD comes with an open-source Python package that fully automates benchmarking processes.
We focus on one of the most crucial problems in AIDD: drug target binding affinity prediction.
arXiv Detail & Related papers (2022-01-24T12:32:48Z) - Variational Knowledge Distillation for Disease Classification in Chest
X-Rays [102.04931207504173]
We propose itvariational knowledge distillation (VKD), which is a new probabilistic inference framework for disease classification based on X-rays.
We demonstrate the effectiveness of our method on three public benchmark datasets with paired X-ray images and EHRs.
arXiv Detail & Related papers (2021-03-19T14:13:56Z) - Detecting Spurious Correlations with Sanity Tests for Artificial
Intelligence Guided Radiology Systems [22.249702822013045]
A critical component to deploying AI in radiology is to gain confidence in a developed system's efficacy and safety.
The current gold standard approach is to conduct an analytical validation of performance on a generalization dataset.
We describe a series of sanity tests to identify when a system performs well on development data for the wrong reasons.
arXiv Detail & Related papers (2021-03-04T14:14:05Z) - UNITE: Uncertainty-based Health Risk Prediction Leveraging Multi-sourced
Data [81.00385374948125]
We present UNcertaInTy-based hEalth risk prediction (UNITE) model.
UNITE provides accurate disease risk prediction and uncertainty estimation leveraging multi-sourced health data.
We evaluate UNITE on real-world disease risk prediction tasks: nonalcoholic fatty liver disease (NASH) and Alzheimer's disease (AD)
UNITE achieves up to 0.841 in F1 score for AD detection, up to 0.609 in PR-AUC for NASH detection, and outperforms various state-of-the-art baselines by up to $19%$ over the best baseline.
arXiv Detail & Related papers (2020-10-22T02:28:11Z) - Semi-supervised Medical Image Classification with Relation-driven
Self-ensembling Model [71.80319052891817]
We present a relation-driven semi-supervised framework for medical image classification.
It exploits the unlabeled data by encouraging the prediction consistency of given input under perturbations.
Our method outperforms many state-of-the-art semi-supervised learning methods on both single-label and multi-label image classification scenarios.
arXiv Detail & Related papers (2020-05-15T06:57:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.