Sample Selection Bias in Machine Learning for Healthcare
- URL: http://arxiv.org/abs/2405.07841v2
- Date: Tue, 26 Nov 2024 21:13:05 GMT
- Title: Sample Selection Bias in Machine Learning for Healthcare
- Authors: Vinod Kumar Chauhan, Lei Clifton, Achille Salaün, Huiqi Yvonne Lu, Kim Branson, Patrick Schwab, Gaurav Nigam, David A. Clifton,
- Abstract summary: We focus on sample selection bias ( SSB), a specific type of bias where the study population is less representative of the target population.
Existing machine learning techniques try to correct the bias mostly by balancing distributions between the study and the target populations.
We propose a new research direction for addressing SSB, based on the target population identification rather than the bias correction.
- Score: 17.549969100454803
- License:
- Abstract: While machine learning algorithms hold promise for personalised medicine, their clinical adoption remains limited, partly due to biases that can compromise the reliability of predictions. In this paper, we focus on sample selection bias (SSB), a specific type of bias where the study population is less representative of the target population, leading to biased and potentially harmful decisions. Despite being well-known in the literature, SSB remains scarcely studied in machine learning for healthcare. Moreover, the existing machine learning techniques try to correct the bias mostly by balancing distributions between the study and the target populations, which may result in a loss of predictive performance. To address these problems, our study illustrates the potential risks associated with SSB by examining SSB's impact on the performance of machine learning algorithms. Most importantly, we propose a new research direction for addressing SSB, based on the target population identification rather than the bias correction. Specifically, we propose two independent networks(T-Net) and a multitasking network (MT-Net) for addressing SSB, where one network/task identifies the target subpopulation which is representative of the study population and the second makes predictions for the identified subpopulation. Our empirical results with synthetic and semi-synthetic datasets highlight that SSB can lead to a large drop in the performance of an algorithm for the target population as compared with the study population, as well as a substantial difference in the performance for the target subpopulations that are representative of the selected and the non-selected patients from the study population. Furthermore, our proposed techniques demonstrate robustness across various settings, including different dataset sizes, event rates, and selection rates, outperforming the existing bias correction techniques.
Related papers
- Unsupervised Search for Ethnic Minorities' Medical Segmentation Training Set [5.880582406602758]
This article investigates the critical issue of dataset bias in medical imaging, with a particular emphasis on racial disparities.
Our analysis reveals that medical segmentation datasets are significantly biased, primarily influenced by the demographic composition of their collection sites.
We propose a novel training set search strategy aimed at reducing these biases by focusing on underrepresented racial groups.
arXiv Detail & Related papers (2025-01-05T05:04:47Z) - Metric-DST: Mitigating Selection Bias Through Diversity-Guided Semi-Supervised Metric Learning [0.0]
Semi-supervised learning strategies like self-training can mitigate selection bias by incorporating unlabeled data into model training.
We propose Metric-DST, a diversity-guided self-training strategy that leverages metric learning and its implicit embedding space to counter confidence-based bias.
arXiv Detail & Related papers (2024-11-27T15:29:42Z) - Debias-CLR: A Contrastive Learning Based Debiasing Method for Algorithmic Fairness in Healthcare Applications [0.17624347338410748]
We proposed an implicit in-processing debiasing method to combat disparate treatment.
We used clinical notes of heart failure patients and used diagnostic codes, procedure reports and physiological vitals of the patients.
We found that Debias-CLR was able to reduce the Single-Category Word Embedding Association Test (SC-WEAT) effect size score when debiasing for gender and ethnicity.
arXiv Detail & Related papers (2024-11-15T19:32:01Z) - ASPEST: Bridging the Gap Between Active Learning and Selective
Prediction [56.001808843574395]
Selective prediction aims to learn a reliable model that abstains from making predictions when uncertain.
Active learning aims to lower the overall labeling effort, and hence human dependence, by querying the most informative examples.
In this work, we introduce a new learning paradigm, active selective prediction, which aims to query more informative samples from the shifted target domain.
arXiv Detail & Related papers (2023-04-07T23:51:07Z) - Delving into Identify-Emphasize Paradigm for Combating Unknown Bias [52.76758938921129]
We propose an effective bias-conflicting scoring method (ECS) to boost the identification accuracy.
We also propose gradient alignment (GA) to balance the contributions of the mined bias-aligned and bias-conflicting samples.
Experiments are conducted on multiple datasets in various settings, demonstrating that the proposed solution can mitigate the impact of unknown biases.
arXiv Detail & Related papers (2023-02-22T14:50:24Z) - D-BIAS: A Causality-Based Human-in-the-Loop System for Tackling
Algorithmic Bias [57.87117733071416]
We propose D-BIAS, a visual interactive tool that embodies human-in-the-loop AI approach for auditing and mitigating social biases.
A user can detect the presence of bias against a group by identifying unfair causal relationships in the causal network.
For each interaction, say weakening/deleting a biased causal edge, the system uses a novel method to simulate a new (debiased) dataset.
arXiv Detail & Related papers (2022-08-10T03:41:48Z) - Targeted Optimal Treatment Regime Learning Using Summary Statistics [12.767669486030352]
We consider an ITR estimation problem where the source and target populations may be heterogeneous.
We develop a weighting framework that tailors an ITR for a given target population by leveraging the available summary statistics.
Specifically, we propose a calibrated augmented inverse probability weighted estimator of the value function for the target population and estimate an optimal ITR.
arXiv Detail & Related papers (2022-01-17T06:11:31Z) - Statistical discrimination in learning agents [64.78141757063142]
Statistical discrimination emerges in agent policies as a function of both the bias in the training population and of agent architecture.
We show that less discrimination emerges with agents that use recurrent neural networks, and when their training environment has less bias.
arXiv Detail & Related papers (2021-10-21T18:28:57Z) - Targeting Underrepresented Populations in Precision Medicine: A
Federated Transfer Learning Approach [7.467496975496821]
We propose a two-way data integration strategy that integrates heterogeneous data from diverse populations and from multiple healthcare institutions.
We show that the proposed method improves the estimation and prediction accuracy in underrepresented populations, and reduces the gap of model performance across populations.
arXiv Detail & Related papers (2021-08-27T04:04:34Z) - Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model.
We introduce two unique positive sampling strategies specifically tailored for EHR data.
Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z) - Balancing Biases and Preserving Privacy on Balanced Faces in the Wild [50.915684171879036]
There are demographic biases present in current facial recognition (FR) models.
We introduce our Balanced Faces in the Wild dataset to measure these biases across different ethnic and gender subgroups.
We find that relying on a single score threshold to differentiate between genuine and imposters sample pairs leads to suboptimal results.
We propose a novel domain adaptation learning scheme that uses facial features extracted from state-of-the-art neural networks.
arXiv Detail & Related papers (2021-03-16T15:05:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.