Disparate Censorship & Undertesting: A Source of Label Bias in Clinical
Machine Learning
- URL: http://arxiv.org/abs/2208.01127v1
- Date: Mon, 1 Aug 2022 20:15:31 GMT
- Title: Disparate Censorship & Undertesting: A Source of Label Bias in Clinical
Machine Learning
- Authors: Trenton Chang, Michael W. Sjoding, Jenna Wiens
- Abstract summary: Disparate censorship in patients of equivalent risk leads to undertesting in certain groups, and in turn, more biased labels for such groups.
Our findings call attention to disparate censorship as a source of label bias in clinical ML models.
- Score: 14.133370438685969
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As machine learning (ML) models gain traction in clinical applications,
understanding the impact of clinician and societal biases on ML models is
increasingly important. While biases can arise in the labels used for model
training, the many sources from which these biases arise are not yet
well-studied. In this paper, we highlight disparate censorship (i.e.,
differences in testing rates across patient groups) as a source of label bias
that clinical ML models may amplify, potentially causing harm. Many patient
risk-stratification models are trained using the results of clinician-ordered
diagnostic and laboratory tests of labels. Patients without test results are
often assigned a negative label, which assumes that untested patients do not
experience the outcome. Since orders are affected by clinical and resource
considerations, testing may not be uniform in patient populations, giving rise
to disparate censorship. Disparate censorship in patients of equivalent risk
leads to undertesting in certain groups, and in turn, more biased labels for
such groups. Using such biased labels in standard ML pipelines could contribute
to gaps in model performance across patient groups. Here, we theoretically and
empirically characterize conditions in which disparate censorship or
undertesting affect model performance across subgroups. Our findings call
attention to disparate censorship as a source of label bias in clinical ML
models.
Related papers
- Debias-CLR: A Contrastive Learning Based Debiasing Method for Algorithmic Fairness in Healthcare Applications [0.17624347338410748]
We proposed an implicit in-processing debiasing method to combat disparate treatment.
We used clinical notes of heart failure patients and used diagnostic codes, procedure reports and physiological vitals of the patients.
We found that Debias-CLR was able to reduce the Single-Category Word Embedding Association Test (SC-WEAT) effect size score when debiasing for gender and ethnicity.
arXiv Detail & Related papers (2024-11-15T19:32:01Z) - From Biased Selective Labels to Pseudo-Labels: An Expectation-Maximization Framework for Learning from Biased Decisions [9.440055827786596]
We study a clinically-inspired selective label problem called disparate censorship.
Disparate Censorship Expectation-Maximization (DCEM) is an algorithm for learning in the presence of such censorship.
arXiv Detail & Related papers (2024-06-27T03:33:38Z) - How Does Pruning Impact Long-Tailed Multi-Label Medical Image
Classifiers? [49.35105290167996]
Pruning has emerged as a powerful technique for compressing deep neural networks, reducing memory usage and inference time without significantly affecting overall performance.
This work represents a first step toward understanding the impact of pruning on model behavior in deep long-tailed, multi-label medical image classification.
arXiv Detail & Related papers (2023-08-17T20:40:30Z) - Avoiding Biased Clinical Machine Learning Model Performance Estimates in
the Presence of Label Selection [3.3944964838781093]
We describe three classes of label selection and simulate five causally distinct scenarios to assess how particular selection mechanisms bias a suite of commonly reported binary machine learning model performance metrics.
We find that naive estimates of AUROC on the observed population undershoot actual performance by up to 20%.
Such a disparity could be large enough to lead to the wrongful termination of a successful clinical decision support tool.
arXiv Detail & Related papers (2022-09-15T22:30:14Z) - Write It Like You See It: Detectable Differences in Clinical Notes By
Race Lead To Differential Model Recommendations [15.535251319178379]
We investigate the level of implicit race information available to machine learning models and human experts.
We find that models can identify patient self-reported race from clinical notes even when the notes are stripped of explicit indicators of race.
We show that models trained on these race-redacted clinical notes can still perpetuate existing biases in clinical treatment decisions.
arXiv Detail & Related papers (2022-05-08T18:24:11Z) - What Do You See in this Patient? Behavioral Testing of Clinical NLP
Models [69.09570726777817]
We introduce an extendable testing framework that evaluates the behavior of clinical outcome models regarding changes of the input.
We show that model behavior varies drastically even when fine-tuned on the same data and that allegedly best-performing models have not always learned the most medically plausible patterns.
arXiv Detail & Related papers (2021-11-30T15:52:04Z) - Algorithmic encoding of protected characteristics and its implications
on disparities across subgroups [17.415882865534638]
Machine learning models may pick up undesirable correlations between a patient's racial identity and clinical outcome.
Very little is known about how these biases are encoded and how one may reduce or even remove disparate performance.
arXiv Detail & Related papers (2021-10-27T20:30:57Z) - LOGAN: Local Group Bias Detection by Clustering [86.38331353310114]
We argue that evaluating bias at the corpus level is not enough for understanding how biases are embedded in a model.
We propose LOGAN, a new bias detection technique based on clustering.
Experiments on toxicity classification and object classification tasks show that LOGAN identifies bias in a local region.
arXiv Detail & Related papers (2020-10-06T16:42:51Z) - Understanding Classifier Mistakes with Generative Models [88.20470690631372]
Deep neural networks are effective on supervised learning tasks, but have been shown to be brittle.
In this paper, we leverage generative models to identify and characterize instances where classifiers fail to generalize.
Our approach is agnostic to class labels from the training set which makes it applicable to models trained in a semi-supervised way.
arXiv Detail & Related papers (2020-10-05T22:13:21Z) - Hemogram Data as a Tool for Decision-making in COVID-19 Management:
Applications to Resource Scarcity Scenarios [62.997667081978825]
COVID-19 pandemics has challenged emergency response systems worldwide, with widespread reports of essential services breakdown and collapse of health care structure.
This work describes a machine learning model derived from hemogram exam data performed in symptomatic patients.
Proposed models can predict COVID-19 qRT-PCR results in symptomatic individuals with high accuracy, sensitivity and specificity.
arXiv Detail & Related papers (2020-05-10T01:45:03Z) - Noisy Adaptive Group Testing using Bayesian Sequential Experimental
Design [63.48989885374238]
When the infection prevalence of a disease is low, Dorfman showed 80 years ago that testing groups of people can prove more efficient than testing people individually.
Our goal in this paper is to propose new group testing algorithms that can operate in a noisy setting.
arXiv Detail & Related papers (2020-04-26T23:41:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.