Related papers: Validity problems in clinical machine learning by indirect data labeling using consensus definitions

Validity problems in clinical machine learning by indirect data labeling using consensus definitions

URL: http://arxiv.org/abs/2311.03037v1
Date: Mon, 6 Nov 2023 11:14:48 GMT
Title: Validity problems in clinical machine learning by indirect data labeling using consensus definitions
Authors: Michael Hagmann and Shigehiko Schamoni and Stefan Riezler
Abstract summary: We demonstrate a validity problem of machine learning in the vital application area of disease diagnosis in medicine. It arises when target labels in training data are determined by an indirect measurement, and the fundamental measurements needed to determine this indirect measurement are included in the input data representation.
Score: 18.18186817228833
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We demonstrate a validity problem of machine learning in the vital application area of disease diagnosis in medicine. It arises when target labels in training data are determined by an indirect measurement, and the fundamental measurements needed to determine this indirect measurement are included in the input data representation. Machine learning models trained on this data will learn nothing else but to exactly reconstruct the known target definition. Such models show perfect performance on similarly constructed test data but will fail catastrophically on real-world examples where the defining fundamental measurements are not or only incompletely available. We present a general procedure allowing identification of problematic datasets and black-box machine learning models trained on them, and exemplify our detection procedure on the task of early prediction of sepsis.

Related papers

MIEO: encoding clinical data to enhance cardiovascular event prediction [31.458406135473805]
Machine learning methods have been employed to extract knowledge from clinical data and predict clinical events.<n>While promising, approaches suffer from at least two main issues: low availability of labelled data and data leading to missing values.<n>This work proposes the use of self-supervised auto-encoders to efficiently address these challenges.
arXiv Detail & Related papers (2025-10-13T10:47:49Z)
Diagnosing Medical Datasets with Training Dynamics [0.0]
This study explores the potential of using training dynamics as an automated alternative to human annotation. The framework used is Data Maps, which classifies data points into categories such as easy-to-learn, hard-to-learn, and ambiguous. A comprehensive evaluation was conducted to assess the feasibility and transferability of the Data Maps framework to the medical domain.
arXiv Detail & Related papers (2024-11-03T18:37:35Z)
Transfer Learning for Real-time Deployment of a Screening Tool for Depression Detection Using Actigraphy [8.430502131775722]
We present an approach based on transfer learning, from a model trained on a secondary dataset, for the real time deployment of the depression screening tool based on the actigraphy data of users. A modified version of leave one out cross validation approach performed on the primary set resulted in mean accuracy of 0.96, where in each one subject's data from the primary set was set aside for testing.
arXiv Detail & Related papers (2023-03-14T12:37:22Z)
Self-Supervised Learning as a Means To Reduce the Need for Labeled Data in Medical Image Analysis [64.4093648042484]
We use a dataset of chest X-ray images with bounding box labels for 13 different classes of anomalies. We show that it is possible to achieve similar performance to a fully supervised model in terms of mean average precision and accuracy with only 60% of the labeled data.
arXiv Detail & Related papers (2022-06-01T09:20:30Z)
Data-SUITE: Data-centric identification of in-distribution incongruous examples [81.21462458089142]
Data-SUITE is a data-centric framework to identify incongruous regions of in-distribution (ID) data. We empirically validate Data-SUITE's performance and coverage guarantees.
arXiv Detail & Related papers (2022-02-17T18:58:31Z)
False perfection in machine prediction: Detecting and assessing circularity problems in machine learning [11.878820609988695]
We demonstrate a problem of machine learning in vital application areas such as medical informatics or patent law. The inclusion of measurements on which target outputs are deterministically defined in the representations of input data leads to perfect, but circular predictions. We argue that a transfer of research results to real-world applications requires to avoid circularity by separating measurements that define target outcomes from data representations.
arXiv Detail & Related papers (2021-06-23T14:11:06Z)
Attack-agnostic Adversarial Detection on Medical Data Using Explainable Machine Learning [0.0]
We propose a model agnostic explainability-based method for the accurate detection of adversarial samples on two datasets. On the MIMIC-III and Henan-Renmin EHR datasets, we report a detection accuracy of 77% against the Longitudinal Adrial Attack. On the MIMIC-CXR dataset, we achieve an accuracy of 88%; significantly improving on the state of the art of adversarial detection in both datasets by over 10% in all settings.
arXiv Detail & Related papers (2021-05-05T10:01:53Z)
Graph Embedding with Data Uncertainty [113.39838145450007]
spectral-based subspace learning is a common data preprocessing step in many machine learning pipelines. Most subspace learning methods do not take into consideration possible measurement inaccuracies or artifacts that can lead to data with high uncertainty.
arXiv Detail & Related papers (2020-09-01T15:08:23Z)
Trajectories, bifurcations and pseudotime in large clinical datasets: applications to myocardial infarction and diabetes data [94.37521840642141]
We suggest a semi-supervised methodology for the analysis of large clinical datasets, characterized by mixed data types and missing values. The methodology is based on application of elastic principal graphs which can address simultaneously the tasks of dimensionality reduction, data visualization, clustering, feature selection and quantifying the geodesic distances (pseudotime) in partially ordered sequences of observations.
arXiv Detail & Related papers (2020-07-07T21:04:55Z)
Semi-supervised Medical Image Classification with Relation-driven Self-ensembling Model [71.80319052891817]
We present a relation-driven semi-supervised framework for medical image classification. It exploits the unlabeled data by encouraging the prediction consistency of given input under perturbations. Our method outperforms many state-of-the-art semi-supervised learning methods on both single-label and multi-label image classification scenarios.
arXiv Detail & Related papers (2020-05-15T06:57:54Z)
An Extensive Study on Cross-Dataset Bias and Evaluation Metrics Interpretation for Machine Learning applied to Gastrointestinal Tract Abnormality Classification [2.985964157078619]
Automatic analysis of diseases in the GI tract is a hot topic in computer science and medical-related journals. A clear understanding of evaluation metrics and machine learning models with cross datasets is crucial to bring research in the field to a new quality level. We present comprehensive evaluations of five distinct machine learning models that can classify 16 different GI tract conditions.
arXiv Detail & Related papers (2020-05-08T08:59:31Z)
Self-Training with Improved Regularization for Sample-Efficient Chest X-Ray Classification [80.00316465793702]
We present a deep learning framework that enables robust modeling in challenging scenarios. Our results show that using 85% lesser labeled data, we can build predictive models that match the performance of classifiers trained in a large-scale data setting.
arXiv Detail & Related papers (2020-05-03T02:36:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.