A method for comparing multiple imputation techniques: a case study on
the U.S. National COVID Cohort Collaborative
- URL: http://arxiv.org/abs/2206.06444v2
- Date: Sun, 25 Sep 2022 04:56:22 GMT
- Title: A method for comparing multiple imputation techniques: a case study on
the U.S. National COVID Cohort Collaborative
- Authors: Elena Casiraghi, Rachel Wong, Margaret Hall, Ben Coleman, Marco
Notaro, Michael D. Evans, Jena S. Tronieri, Hannah Blau, Bryan Laraway,
Tiffany J. Callahan, Lauren E. Chan, Carolyn T. Bramante, John B. Buse,
Richard A. Moffitt, Til Sturmer, Steven G. Johnson, Yu Raymond Shao, Justin
Reese, Peter N. Robinson, Alberto Paccanaro, Giorgio Valentini, Jared D.
Huling and Kenneth Wilkins (on behalf of the N3C Consortium): Tell Bennet,
Christopher Chute, Peter DeWitt, Kenneth Gersing, Andrew Girvin, Melissa
Haendel, Jeremy Harper, Janos Hajagos, Stephanie Hong, Emily Pfaff, Jane
Reusch, Corneliu Antoniescu, Kimberly Robaski
- Abstract summary: We numerically evaluate strategies for handling missing data in the context of statistical analysis.
Our approach could effectively highlight the most valid and performant missing-data handling strategy.
- Score: 1.259457977936316
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Healthcare datasets obtained from Electronic Health Records have proven to be
extremely useful to assess associations between patients' predictors and
outcomes of interest. However, these datasets often suffer from missing values
in a high proportion of cases and the simple removal of these cases may
introduce severe bias. For these reasons, several multiple imputation
algorithms have been proposed to attempt to recover the missing information.
Each algorithm presents strengths and weaknesses, and there is currently no
consensus on which multiple imputation algorithms works best in a given
scenario. Furthermore, the selection of each algorithm parameters and
data-related modelling choices are also both crucial and challenging. In this
paper, we propose a novel framework to numerically evaluate strategies for
handling missing data in the context of statistical analysis, with a particular
focus on multiple imputation techniques. We demonstrate the feasibility of our
approach on a large cohort of type-2 diabetes patients provided by the National
COVID Cohort Collaborative (N3C) Enclave, where we explored the influence of
various patient characteristics on outcomes related to COVID-19. Our analysis
included classic multiple imputation techniques as well as simple complete-case
Inverse Probability Weighted models. The experiments presented here show that
our approach could effectively highlight the most valid and performant
missing-data handling strategy for our case study. Moreover, our methodology
allowed us to gain an understanding of the behavior of the different models and
of how it changed as we modified their parameters. Our method is general and
can be applied to different research fields and on datasets containing
heterogeneous types.
Related papers
- Few-shot learning for COVID-19 Chest X-Ray Classification with
Imbalanced Data: An Inter vs. Intra Domain Study [49.5374512525016]
Medical image datasets are essential for training models used in computer-aided diagnosis, treatment planning, and medical research.
Some challenges are associated with these datasets, including variability in data distribution, data scarcity, and transfer learning issues when using models pre-trained from generic images.
We propose a methodology based on Siamese neural networks in which a series of techniques are integrated to mitigate the effects of data scarcity and distribution imbalance.
arXiv Detail & Related papers (2024-01-18T16:59:27Z) - Counterfactual Data Augmentation with Contrastive Learning [27.28511396131235]
We introduce a model-agnostic data augmentation method that imputes the counterfactual outcomes for a selected subset of individuals.
We use contrastive learning to learn a representation space and a similarity measure such that in the learned representation space close individuals identified by the learned similarity measure have similar potential outcomes.
This property ensures reliable imputation of counterfactual outcomes for the individuals with close neighbors from the alternative treatment group.
arXiv Detail & Related papers (2023-11-07T00:36:51Z) - Multi-objective optimization determines when, which and how to fuse deep
networks: an application to predict COVID-19 outcomes [1.8351254916713304]
We present a novel approach to optimize the setup of a multimodal end-to-end model.
We test our method on the AIforCOVID dataset, attaining state-of-the-art results.
arXiv Detail & Related papers (2022-04-07T23:07:33Z) - Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class.
This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples.
Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z) - Explaining medical AI performance disparities across sites with
confounder Shapley value analysis [8.785345834486057]
Multi-site evaluations are key to diagnosing such disparities.
Our framework provides a method for quantifying the marginal and cumulative effect of each type of bias on the overall performance difference.
We demonstrate its usefulness in a case study of a deep learning model trained to detect the presence of pneumothorax.
arXiv Detail & Related papers (2021-11-12T18:54:10Z) - Lung Cancer Risk Estimation with Incomplete Data: A Joint Missing
Imputation Perspective [5.64530854079352]
We address imputation of missing data by modeling the joint distribution of multi-modal data.
Motivated by partial bidirectional generative adversarial net (PBiGAN), we propose a new Conditional PBiGAN (C-PBiGAN) method.
C-PBiGAN achieves significant improvements in lung cancer risk estimation compared with representative imputation methods.
arXiv Detail & Related papers (2021-07-25T20:15:16Z) - Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model.
We introduce two unique positive sampling strategies specifically tailored for EHR data.
Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z) - Adversarial Sample Enhanced Domain Adaptation: A Case Study on
Predictive Modeling with Electronic Health Records [57.75125067744978]
We propose a data augmentation method to facilitate domain adaptation.
adversarially generated samples are used during domain adaptation.
Results confirm the effectiveness of our method and the generality on different tasks.
arXiv Detail & Related papers (2021-01-13T03:20:20Z) - Mixture Model Framework for Traumatic Brain Injury Prognosis Using
Heterogeneous Clinical and Outcome Data [3.7363119896212478]
We develop a method for modeling large heterogeneous data types relevant to TBI.
The model is trained on a dataset encompassing a variety of data types, including demographics, blood-based biomarkers, and imaging findings.
It is used to stratify patients into distinct groups in an unsupervised learning setting.
arXiv Detail & Related papers (2020-12-22T19:31:03Z) - Select-ProtoNet: Learning to Select for Few-Shot Disease Subtype
Prediction [55.94378672172967]
We focus on few-shot disease subtype prediction problem, identifying subgroups of similar patients.
We introduce meta learning techniques to develop a new model, which can extract the common experience or knowledge from interrelated clinical tasks.
Our new model is built upon a carefully designed meta-learner, called Prototypical Network, that is a simple yet effective meta learning machine for few-shot image classification.
arXiv Detail & Related papers (2020-09-02T02:50:30Z) - Predictive Modeling of ICU Healthcare-Associated Infections from
Imbalanced Data. Using Ensembles and a Clustering-Based Undersampling
Approach [55.41644538483948]
This work is focused on both the identification of risk factors and the prediction of healthcare-associated infections in intensive-care units.
The aim is to support decision making addressed at reducing the incidence rate of infections.
arXiv Detail & Related papers (2020-05-07T16:13:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.