Related papers: Problems and shortcuts in deep learning for screening mammography

Problems and shortcuts in deep learning for screening mammography

URL: http://arxiv.org/abs/2303.16417v1
Date: Wed, 29 Mar 2023 02:50:59 GMT
Title: Problems and shortcuts in deep learning for screening mammography
Authors: Trevor Tsue, Brent Mombourquette, Ahmed Taha, Thomas Paul Matthews, Yen Nhi Truong Vu, Jason Su
Abstract summary: This work reveals undiscovered challenges in the performance and generalizability of deep learning models. We trained an AI model to classify cancer on a retrospective dataset of 120,112 US exams (3,467 cancers) acquired from 2008 to 2017. We evaluated on a screening mammography test set of 11,593 US exams (102 cancers; 7,594 women; age 57.1 pm 11.0) and 1,880 UK exams (590 cancers; 1,745 women; age 63.3 pm 7.2)
Score: 2.9033848132822726
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This work reveals undiscovered challenges in the performance and generalizability of deep learning models. We (1) identify spurious shortcuts and evaluation issues that can inflate performance and (2) propose training and analysis methods to address them. We trained an AI model to classify cancer on a retrospective dataset of 120,112 US exams (3,467 cancers) acquired from 2008 to 2017 and 16,693 UK exams (5,655 cancers) acquired from 2011 to 2015. We evaluated on a screening mammography test set of 11,593 US exams (102 cancers; 7,594 women; age 57.1 \pm 11.0) and 1,880 UK exams (590 cancers; 1,745 women; age 63.3 \pm 7.2). A model trained on images of only view markers (no breast) achieved a 0.691 AUC. The original model trained on both datasets achieved a 0.945 AUC on the combined US+UK dataset but paradoxically only 0.838 and 0.892 on the US and UK datasets, respectively. Sampling cancers equally from both datasets during training mitigated this shortcut. A similar AUC paradox (0.903) occurred when evaluating diagnostic exams vs screening exams (0.862 vs 0.861, respectively). Removing diagnostic exams during training alleviated this bias. Finally, the model did not exhibit the AUC paradox over scanner models but still exhibited a bias toward Selenia Dimension (SD) over Hologic Selenia (HS) exams. Analysis showed that this AUC paradox occurred when a dataset attribute had values with a higher cancer prevalence (dataset bias) and the model consequently assigned a higher probability to these attribute values (model bias). Stratification and balancing cancer prevalence can mitigate shortcuts during evaluation. Dataset and model bias can introduce shortcuts and the AUC paradox, potentially pervasive issues within the healthcare AI space. Our methods can verify and mitigate shortcuts while providing a clear understanding of performance.

Related papers

Predicting Length of Stay in Neurological ICU Patients Using Classical Machine Learning and Neural Network Models: A Benchmark Study on MIMIC-IV [49.1574468325115]
This study explores multiple ML approaches for predicting LOS in ICU specifically for the patients with neurological diseases based on the MIMIC-IV dataset.<n>The evaluated models include classic ML algorithms (K-Nearest Neighbors, Random Forest, XGBoost and CatBoost) and Neural Networks (LSTM, BERT and Temporal Fusion Transformer)
arXiv Detail & Related papers (2025-05-23T14:06:42Z)
Subgroup Performance of a Commercial Digital Breast Tomosynthesis Model for Breast Cancer Detection [5.089670339445636]
This study presents a granular evaluation of the Lunit INSIGHT model on a large retrospective cohort of 163,449 screening mammography exams. Performance was found to be robust across demographics, but cases with non-invasive cancers were associated with significantly lower performance.
arXiv Detail & Related papers (2025-03-17T17:17:36Z)
Artificial Intelligence-Based Triaging of Cutaneous Melanocytic Lesions [0.8864540224289991]
Pathologists are facing an increasing workload due to a growing volume of cases and the need for more comprehensive diagnoses. We developed an artificial intelligence (AI) model for triaging cutaneous melanocytic lesions based on whole slide images.
arXiv Detail & Related papers (2024-10-14T13:49:04Z)
Incorporating Anatomical Awareness for Enhanced Generalizability and Progression Prediction in Deep Learning-Based Radiographic Sacroiliitis Detection [0.8248058061511542]
The aim of this study was to examine whether incorporating anatomical awareness into a deep learning model can improve generalizability and enable prediction of disease progression. The performance of the models was compared using the area under the receiver operating characteristic curve (AUC), accuracy, sensitivity, and specificity.
arXiv Detail & Related papers (2024-05-12T20:02:25Z)
AI in Lung Health: Benchmarking Detection and Diagnostic Models Across Multiple CT Scan Datasets [0.33923727961771083]
Lung cancer's high mortality rate can be mitigated by early detection, increasingly reliant on AI for diagnostic imaging. This study develops and validates AI models for both nodule detection and cancer classification tasks.
arXiv Detail & Related papers (2024-05-07T18:36:40Z)
Using Pre-training and Interaction Modeling for ancestry-specific disease prediction in UK Biobank [69.90493129893112]
Recent genome-wide association studies (GWAS) have uncovered the genetic basis of complex traits, but show an under-representation of non-European descent individuals. Here, we assess whether we can improve disease prediction across diverse ancestries using multiomic data.
arXiv Detail & Related papers (2024-04-26T16:39:50Z)
Detection of subclinical atherosclerosis by image-based deep learning on chest x-ray [86.38767955626179]
Deep-learning algorithm to predict coronary artery calcium (CAC) score was developed on 460 chest x-ray. The diagnostic accuracy of the AICAC model assessed by the area under the curve (AUC) was the primary outcome.
arXiv Detail & Related papers (2024-03-27T16:56:14Z)
Performance of externally validated machine learning models based on histopathology images for the diagnosis, classification, prognosis, or treatment outcome prediction in female breast cancer: A systematic review [0.5792122879054292]
externally validated machine learning models for diagnosis, classification, prognosis, or treatment outcome prediction in female breast cancer. Three studies externally validated ML models for diagnosis, 4 for classification, 2 for prognosis, and 1 for both classification and prognosis. Most studies used Convolutional Neural Networks and one used logistic regression algorithms.
arXiv Detail & Related papers (2023-12-09T18:27:56Z)
Revisiting Computer-Aided Tuberculosis Diagnosis [56.80999479735375]
Tuberculosis (TB) is a major global health threat, causing millions of deaths annually. Computer-aided tuberculosis diagnosis (CTD) using deep learning has shown promise, but progress is hindered by limited training data. We establish a large-scale dataset, namely the Tuberculosis X-ray (TBX11K) dataset, which contains 11,200 chest X-ray (CXR) images with corresponding bounding box annotations for TB areas. This dataset enables the training of sophisticated detectors for high-quality CTD.
arXiv Detail & Related papers (2023-07-06T08:27:48Z)
Building Brains: Subvolume Recombination for Data Augmentation in Large Vessel Occlusion Detection [56.67577446132946]
A large training data set is required for a standard deep learning-based model to learn this strategy from data. We propose an augmentation method that generates artificial training samples by recombining vessel tree segmentations of the hemispheres from different patients. In line with the augmentation scheme, we use a 3D-DenseNet fed with task-specific input, fostering a side-by-side comparison between the hemispheres.
arXiv Detail & Related papers (2022-05-05T10:31:57Z)
Deep learning-based COVID-19 pneumonia classification using chest CT images: model generalizability [54.86482395312936]
Deep learning (DL) classification models were trained to identify COVID-19-positive patients on 3D computed tomography (CT) datasets from different countries. We trained nine identical DL-based classification models by using combinations of the datasets with a 72% train, 8% validation, and 20% test data split. The models trained on multiple datasets and evaluated on a test set from one of the datasets used for training performed better.
arXiv Detail & Related papers (2021-02-18T21:14:52Z)
Deep Learning Applied to Chest X-Rays: Exploiting and Preventing Shortcuts [11.511323714777298]
This paper studies the case of spurious class skew in which patients with a particular attribute are spuriously more likely to have the outcome of interest. We show that deep nets can accurately identify many patient attributes including sex (AUROC = 0.96) and age (AUROC >= 0.90) when learning to predict a diagnosis. A simple transfer learning approach is surprisingly effective at preventing the shortcut and promoting good performance.
arXiv Detail & Related papers (2020-09-21T18:52:43Z)
Automated Quantification of CT Patterns Associated with COVID-19 from Chest CT [48.785596536318884]
The proposed method takes as input a non-contrasted chest CT and segments the lesions, lungs, and lobes in three dimensions. The method outputs two combined measures of the severity of lung and lobe involvement, quantifying both the extent of COVID-19 abnormalities and presence of high opacities. Evaluation of the algorithm is reported on CTs of 200 participants (100 COVID-19 confirmed patients and 100 healthy controls) from institutions from Canada, Europe and the United States.
arXiv Detail & Related papers (2020-04-02T21:49:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.