Robustness Stress Testing in Medical Image Classification
- URL: http://arxiv.org/abs/2308.06889v2
- Date: Fri, 15 Sep 2023 08:51:28 GMT
- Title: Robustness Stress Testing in Medical Image Classification
- Authors: Mobarakol Islam and Zeju Li and Ben Glocker
- Abstract summary: We employ stress testing to assess model robustness and subgroup performance disparities in disease detection models.
We apply stress tests to measure the robustness of disease detection models for chest X-ray and skin lesion images.
Our experiments indicate that some models may yield more robust and equitable performance than others.
- Score: 26.094688963784254
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep neural networks have shown impressive performance for image-based
disease detection. Performance is commonly evaluated through clinical
validation on independent test sets to demonstrate clinically acceptable
accuracy. Reporting good performance metrics on test sets, however, is not
always a sufficient indication of the generalizability and robustness of an
algorithm. In particular, when the test data is drawn from the same
distribution as the training data, the iid test set performance can be an
unreliable estimate of the accuracy on new data. In this paper, we employ
stress testing to assess model robustness and subgroup performance disparities
in disease detection models. We design progressive stress testing using five
different bidirectional and unidirectional image perturbations with six
different severity levels. As a use case, we apply stress tests to measure the
robustness of disease detection models for chest X-ray and skin lesion images,
and demonstrate the importance of studying class and domain-specific model
behaviour. Our experiments indicate that some models may yield more robust and
equitable performance than others. We also find that pretraining
characteristics play an important role in downstream robustness. We conclude
that progressive stress testing is a viable and important tool and should
become standard practice in the clinical validation of image-based disease
detection models.
Related papers
- Robustness Testing of Black-Box Models Against CT Degradation Through Test-Time Augmentation [1.7788343872869767]
Deep learning models for medical image segmentation and object detection are becoming increasingly available as clinical products.
As details are rarely provided about the training data, models may unexpectedly fail when cases differ from those in the training distribution.
A method to test the robustness of these models against CT image quality variation is presented.
arXiv Detail & Related papers (2024-06-27T22:17:49Z) - Improving Robustness and Reliability in Medical Image Classification with Latent-Guided Diffusion and Nested-Ensembles [4.249986624493547]
Ensemble deep learning has been shown to achieve high predictive accuracy and uncertainty estimation.
perturbations in the input images at test time can still lead to significant performance degradation.
LaDiNE is a novel and robust probabilistic method that is capable of inferring informative and invariant latent variables from the input images.
arXiv Detail & Related papers (2023-10-24T15:53:07Z) - AI in the Loop -- Functionalizing Fold Performance Disagreement to
Monitor Automated Medical Image Segmentation Pipelines [0.0]
Methods for automatically flag poor performing-predictions are essential for safely implementing machine learning into clinical practice.
We present a readily adoptable method using sub-models trained on different dataset folds, where their disagreement serves as a surrogate for model confidence.
arXiv Detail & Related papers (2023-05-15T21:35:23Z) - Zero-shot Model Diagnosis [80.36063332820568]
A common approach to evaluate deep learning models is to build a labeled test set with attributes of interest and assess how well it performs.
This paper argues the case that Zero-shot Model Diagnosis (ZOOM) is possible without the need for a test set nor labeling.
arXiv Detail & Related papers (2023-03-27T17:59:33Z) - Effective Robustness against Natural Distribution Shifts for Models with
Different Training Data [113.21868839569]
"Effective robustness" measures the extra out-of-distribution robustness beyond what can be predicted from the in-distribution (ID) performance.
We propose a new evaluation metric to evaluate and compare the effective robustness of models trained on different data.
arXiv Detail & Related papers (2023-02-02T19:28:41Z) - Failure Detection in Medical Image Classification: A Reality Check and
Benchmarking Testbed [23.25084022554028]
Failure detection in automated image classification is a critical safeguard for clinical deployment.
Despite its paramount importance, there is insufficient evidence about the ability of state-of-the-art confidence scoring methods to detect test-time failures.
This paper provides a reality check, establishing the performance of in-domain misclassification detection methods.
arXiv Detail & Related papers (2022-05-27T16:50:48Z) - MEMO: Test Time Robustness via Adaptation and Augmentation [131.28104376280197]
We study the problem of test time robustification, i.e., using the test input to improve model robustness.
Recent prior works have proposed methods for test time adaptation, however, they each introduce additional assumptions.
We propose a simple approach that can be used in any test setting where the model is probabilistic and adaptable.
arXiv Detail & Related papers (2021-10-18T17:55:11Z) - Confidence-based Out-of-Distribution Detection: A Comparative Study and
Analysis [17.398553230843717]
We assess the capability of various state-of-the-art approaches for confidence-based OOD detection.
First, we leverage a computer vision benchmark to reproduce and compare multiple OOD detection methods.
We then evaluate their capabilities on the challenging task of disease classification using chest X-rays.
arXiv Detail & Related papers (2021-07-06T12:10:09Z) - Hemogram Data as a Tool for Decision-making in COVID-19 Management:
Applications to Resource Scarcity Scenarios [62.997667081978825]
COVID-19 pandemics has challenged emergency response systems worldwide, with widespread reports of essential services breakdown and collapse of health care structure.
This work describes a machine learning model derived from hemogram exam data performed in symptomatic patients.
Proposed models can predict COVID-19 qRT-PCR results in symptomatic individuals with high accuracy, sensitivity and specificity.
arXiv Detail & Related papers (2020-05-10T01:45:03Z) - Self-Training with Improved Regularization for Sample-Efficient Chest
X-Ray Classification [80.00316465793702]
We present a deep learning framework that enables robust modeling in challenging scenarios.
Our results show that using 85% lesser labeled data, we can build predictive models that match the performance of classifiers trained in a large-scale data setting.
arXiv Detail & Related papers (2020-05-03T02:36:00Z) - Noisy Adaptive Group Testing using Bayesian Sequential Experimental
Design [63.48989885374238]
When the infection prevalence of a disease is low, Dorfman showed 80 years ago that testing groups of people can prove more efficient than testing people individually.
Our goal in this paper is to propose new group testing algorithms that can operate in a noisy setting.
arXiv Detail & Related papers (2020-04-26T23:41:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.