Sample Selection Bias in Machine Learning for Healthcare
- URL: http://arxiv.org/abs/2405.07841v1
- Date: Mon, 13 May 2024 15:30:35 GMT
- Title: Sample Selection Bias in Machine Learning for Healthcare
- Authors: Vinod Kumar Chauhan, Lei Clifton, Achille Salaün, Huiqi Yvonne Lu, Kim Branson, Patrick Schwab, Gaurav Nigam, David A. Clifton,
- Abstract summary: sample selection bias ( SSB) refers to the study population being less representative of the target population, leading to biased and potentially harmful decisions.
Despite being well-known in the literature, SSB remains scarcely studied in machine learning for healthcare.
We propose a new research direction for addressing SSB, based on the target population identification rather than the bias correction.
- Score: 17.549969100454803
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While machine learning algorithms hold promise for personalised medicine, their clinical adoption remains limited. One critical factor contributing to this restraint is sample selection bias (SSB) which refers to the study population being less representative of the target population, leading to biased and potentially harmful decisions. Despite being well-known in the literature, SSB remains scarcely studied in machine learning for healthcare. Moreover, the existing techniques try to correct the bias by balancing distributions between the study and the target populations, which may result in a loss of predictive performance. To address these problems, our study illustrates the potential risks associated with SSB by examining SSB's impact on the performance of machine learning algorithms. Most importantly, we propose a new research direction for addressing SSB, based on the target population identification rather than the bias correction. Specifically, we propose two independent networks (T-Net) and a multitasking network (MT-Net) for addressing SSB, where one network/task identifies the target subpopulation which is representative of the study population and the second makes predictions for the identified subpopulation. Our empirical results with synthetic and semi-synthetic datasets highlight that SSB can lead to a large drop in the performance of an algorithm for the target population as compared with the study population, as well as a substantial difference in the performance for the target subpopulations that are representative of the selected and the non-selected patients from the study population. Furthermore, our proposed techniques demonstrate robustness across various settings, including different dataset sizes, event rates, and selection rates, outperforming the existing bias correction techniques.
Related papers
- Debias-CLR: A Contrastive Learning Based Debiasing Method for Algorithmic Fairness in Healthcare Applications [0.17624347338410748]
We proposed an implicit in-processing debiasing method to combat disparate treatment.
We used clinical notes of heart failure patients and used diagnostic codes, procedure reports and physiological vitals of the patients.
We found that Debias-CLR was able to reduce the Single-Category Word Embedding Association Test (SC-WEAT) effect size score when debiasing for gender and ethnicity.
arXiv Detail & Related papers (2024-11-15T19:32:01Z) - Stable Heterogeneous Treatment Effect Estimation across Out-of-Distribution Populations [27.163528362979594]
Heterogeneous treatment effect (HTE) estimation is vital for understanding the change of treatment effect across individuals or groups.
Most existing HTE estimation methods focus on addressing selection bias induced by imbalanced distributions of confounders between treated and control units.
In real-world applications, where population distributions are subject to continuous changes, there is an urgent need for stable HTE estimation across out-of-distribution populations.
arXiv Detail & Related papers (2024-07-03T13:03:51Z) - Who Are We Missing? A Principled Approach to Characterizing the Underrepresented Population [5.568543786710628]
We introduce an optimization-based approach, Rashomon Set of Optimal Trees (ROOT), to characterize underrepresented groups.
ROOT optimize the target subpopulation distribution by minimizing the variance of the target average treatment effect estimate.
Our framework offers a systematic approach to enhance decision-making accuracy and inform future trials in diverse populations.
arXiv Detail & Related papers (2024-01-25T21:11:35Z) - Delving into Identify-Emphasize Paradigm for Combating Unknown Bias [52.76758938921129]
We propose an effective bias-conflicting scoring method (ECS) to boost the identification accuracy.
We also propose gradient alignment (GA) to balance the contributions of the mined bias-aligned and bias-conflicting samples.
Experiments are conducted on multiple datasets in various settings, demonstrating that the proposed solution can mitigate the impact of unknown biases.
arXiv Detail & Related papers (2023-02-22T14:50:24Z) - Adaptive Identification of Populations with Treatment Benefit in
Clinical Trials: Machine Learning Challenges and Solutions [78.31410227443102]
We study the problem of adaptively identifying patient subpopulations that benefit from a given treatment during a confirmatory clinical trial.
We propose AdaGGI and AdaGCPI, two meta-algorithms for subpopulation construction.
arXiv Detail & Related papers (2022-08-11T14:27:49Z) - Targeted Optimal Treatment Regime Learning Using Summary Statistics [12.767669486030352]
We consider an ITR estimation problem where the source and target populations may be heterogeneous.
We develop a weighting framework that tailors an ITR for a given target population by leveraging the available summary statistics.
Specifically, we propose a calibrated augmented inverse probability weighted estimator of the value function for the target population and estimate an optimal ITR.
arXiv Detail & Related papers (2022-01-17T06:11:31Z) - Statistical discrimination in learning agents [64.78141757063142]
Statistical discrimination emerges in agent policies as a function of both the bias in the training population and of agent architecture.
We show that less discrimination emerges with agents that use recurrent neural networks, and when their training environment has less bias.
arXiv Detail & Related papers (2021-10-21T18:28:57Z) - Targeting Underrepresented Populations in Precision Medicine: A
Federated Transfer Learning Approach [7.467496975496821]
We propose a two-way data integration strategy that integrates heterogeneous data from diverse populations and from multiple healthcare institutions.
We show that the proposed method improves the estimation and prediction accuracy in underrepresented populations, and reduces the gap of model performance across populations.
arXiv Detail & Related papers (2021-08-27T04:04:34Z) - Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model.
We introduce two unique positive sampling strategies specifically tailored for EHR data.
Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z) - Balancing Biases and Preserving Privacy on Balanced Faces in the Wild [50.915684171879036]
There are demographic biases present in current facial recognition (FR) models.
We introduce our Balanced Faces in the Wild dataset to measure these biases across different ethnic and gender subgroups.
We find that relying on a single score threshold to differentiate between genuine and imposters sample pairs leads to suboptimal results.
We propose a novel domain adaptation learning scheme that uses facial features extracted from state-of-the-art neural networks.
arXiv Detail & Related papers (2021-03-16T15:05:49Z) - Predictive Modeling of ICU Healthcare-Associated Infections from
Imbalanced Data. Using Ensembles and a Clustering-Based Undersampling
Approach [55.41644538483948]
This work is focused on both the identification of risk factors and the prediction of healthcare-associated infections in intensive-care units.
The aim is to support decision making addressed at reducing the incidence rate of infections.
arXiv Detail & Related papers (2020-05-07T16:13:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.