Avoiding Biased Clinical Machine Learning Model Performance Estimates in
the Presence of Label Selection
- URL: http://arxiv.org/abs/2209.09188v1
- Date: Thu, 15 Sep 2022 22:30:14 GMT
- Title: Avoiding Biased Clinical Machine Learning Model Performance Estimates in
the Presence of Label Selection
- Authors: Conor K. Corbin, Michael Baiocchi, Jonathan H. Chen
- Abstract summary: We describe three classes of label selection and simulate five causally distinct scenarios to assess how particular selection mechanisms bias a suite of commonly reported binary machine learning model performance metrics.
We find that naive estimates of AUROC on the observed population undershoot actual performance by up to 20%.
Such a disparity could be large enough to lead to the wrongful termination of a successful clinical decision support tool.
- Score: 3.3944964838781093
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: When evaluating the performance of clinical machine learning models, one must
consider the deployment population. When the population of patients with
observed labels is only a subset of the deployment population (label
selection), standard model performance estimates on the observed population may
be misleading. In this study we describe three classes of label selection and
simulate five causally distinct scenarios to assess how particular selection
mechanisms bias a suite of commonly reported binary machine learning model
performance metrics. Simulations reveal that when selection is affected by
observed features, naive estimates of model discrimination may be misleading.
When selection is affected by labels, naive estimates of calibration fail to
reflect reality. We borrow traditional weighting estimators from causal
inference literature and find that when selection probabilities are properly
specified, they recover full population estimates. We then tackle the
real-world task of monitoring the performance of deployed machine learning
models whose interactions with clinicians feed-back and affect the selection
mechanism of the labels. We train three machine learning models to flag
low-yield laboratory diagnostics, and simulate their intended consequence of
reducing wasteful laboratory utilization. We find that naive estimates of AUROC
on the observed population undershoot actual performance by up to 20%. Such a
disparity could be large enough to lead to the wrongful termination of a
successful clinical decision support tool. We propose an altered deployment
procedure, one that combines injected randomization with traditional weighted
estimates, and find it recovers true model performance.
Related papers
- Ecosystem-level Analysis of Deployed Machine Learning Reveals Homogeneous Outcomes [72.13373216644021]
We study the societal impact of machine learning by considering the collection of models that are deployed in a given context.
We find deployed machine learning is prone to systemic failure, meaning some users are exclusively misclassified by all models available.
These examples demonstrate ecosystem-level analysis has unique strengths for characterizing the societal impact of machine learning.
arXiv Detail & Related papers (2023-07-12T01:11:52Z) - In Search of Insights, Not Magic Bullets: Towards Demystification of the
Model Selection Dilemma in Heterogeneous Treatment Effect Estimation [92.51773744318119]
This paper empirically investigates the strengths and weaknesses of different model selection criteria.
We highlight that there is a complex interplay between selection strategies, candidate estimators and the data used for comparing them.
arXiv Detail & Related papers (2023-02-06T16:55:37Z) - Ensemble Method for Estimating Individualized Treatment Effects [15.775032675243995]
We propose an algorithm for aggregating the estimates from a diverse library of models.
We compare ensembling to model selection on 43 benchmark datasets, and find that ensembling wins almost every time.
arXiv Detail & Related papers (2022-02-25T00:44:37Z) - Assessment of contextualised representations in detecting outcome
phrases in clinical trials [14.584741378279316]
We introduce "EBM-COMET", a dataset in which 300 PubMed abstracts are expertly annotated for clinical outcomes.
To extract outcomes, we fine-tune a variety of pre-trained contextualized representations.
We observe our best model (BioBERT) achieve 81.5% F1, 81.3% sensitivity and 98.0% specificity.
arXiv Detail & Related papers (2022-02-13T15:08:00Z) - Assessing Social Determinants-Related Performance Bias of Machine
Learning Models: A case of Hyperchloremia Prediction in ICU Population [6.8473641147443995]
We evaluated four classifiers built to predict Hyperchloremia, a condition that often results from aggressive fluids administration in the ICU population.
We observed that adding social determinants features in addition to the lab-based ones improved model performance on all patients.
We urge future researchers to design models that proactively adjust for potential biases and include subgroup reporting.
arXiv Detail & Related papers (2021-11-18T03:58:50Z) - EventScore: An Automated Real-time Early Warning Score for Clinical
Events [3.3039612529376625]
We build an interpretable model for the early prediction of various adverse clinical events indicative of clinical deterioration.
The model is evaluated on two datasets and four clinical events.
Our model can be entirely automated without requiring any manually recorded features.
arXiv Detail & Related papers (2021-02-11T11:55:08Z) - Double machine learning for sample selection models [0.12891210250935145]
This paper considers the evaluation of discretely distributed treatments when outcomes are only observed for a subpopulation due to sample selection or outcome attrition.
We make use of (a) Neyman-orthogonal, doubly robust, and efficient score functions, which imply the robustness of treatment effect estimation to moderate regularization biases in the machine learning-based estimation of the outcome, treatment, or sample selection models and (b) sample splitting (or cross-fitting) to prevent overfitting bias.
arXiv Detail & Related papers (2020-11-30T19:40:21Z) - LOGAN: Local Group Bias Detection by Clustering [86.38331353310114]
We argue that evaluating bias at the corpus level is not enough for understanding how biases are embedded in a model.
We propose LOGAN, a new bias detection technique based on clustering.
Experiments on toxicity classification and object classification tasks show that LOGAN identifies bias in a local region.
arXiv Detail & Related papers (2020-10-06T16:42:51Z) - Understanding Classifier Mistakes with Generative Models [88.20470690631372]
Deep neural networks are effective on supervised learning tasks, but have been shown to be brittle.
In this paper, we leverage generative models to identify and characterize instances where classifiers fail to generalize.
Our approach is agnostic to class labels from the training set which makes it applicable to models trained in a semi-supervised way.
arXiv Detail & Related papers (2020-10-05T22:13:21Z) - Performance metrics for intervention-triggering prediction models do not
reflect an expected reduction in outcomes from using the model [71.9860741092209]
Clinical researchers often select among and evaluate risk prediction models.
Standard metrics calculated from retrospective data are only related to model utility under certain assumptions.
When predictions are delivered repeatedly throughout time, the relationship between standard metrics and utility is further complicated.
arXiv Detail & Related papers (2020-06-02T16:26:49Z) - Hemogram Data as a Tool for Decision-making in COVID-19 Management:
Applications to Resource Scarcity Scenarios [62.997667081978825]
COVID-19 pandemics has challenged emergency response systems worldwide, with widespread reports of essential services breakdown and collapse of health care structure.
This work describes a machine learning model derived from hemogram exam data performed in symptomatic patients.
Proposed models can predict COVID-19 qRT-PCR results in symptomatic individuals with high accuracy, sensitivity and specificity.
arXiv Detail & Related papers (2020-05-10T01:45:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.