Evaluation of data imputation strategies in complex, deeply-phenotyped
data sets: the case of the EU-AIMS Longitudinal European Autism Project
- URL: http://arxiv.org/abs/2201.09753v1
- Date: Thu, 20 Jan 2022 21:50:38 GMT
- Title: Evaluation of data imputation strategies in complex, deeply-phenotyped
data sets: the case of the EU-AIMS Longitudinal European Autism Project
- Authors: A. Llera, M. Brammer, B. Oakley, J. Tillmann, M. Zabihi, T. Mei, T.
Charman, C. Ecker, F. Dell Acqua, T. Banaschewski, C. Moessnang, S.
Baron-Cohen, R. Holt, S. Durston, D. Murphy, E. Loth, J. K. Buitelaar, D. L.
Floris, and C. F. Beckmann
- Abstract summary: We evaluate different imputation strategies to fill in missing values in clinical data from a large (total N=764) dataset.
We consider a total of 160 clinical measures divided in 15 overlapping subsets of participants.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: An increasing number of large-scale multi-modal research initiatives has been
conducted in the typically developing population, as well as in psychiatric
cohorts. Missing data is a common problem in such datasets due to the
difficulty of assessing multiple measures on a large number of participants.
The consequences of missing data accumulate when researchers aim to explore
relationships between multiple measures. Here we aim to evaluate different
imputation strategies to fill in missing values in clinical data from a large
(total N=764) and deeply characterised (i.e. range of clinical and cognitive
instruments administered) sample of N=453 autistic individuals and N=311
control individuals recruited as part of the EU-AIMS Longitudinal European
Autism Project (LEAP) consortium. In particular we consider a total of 160
clinical measures divided in 15 overlapping subsets of participants. We use two
simple but common univariate strategies, mean and median imputation, as well as
a Round Robin regression approach involving four independent multivariate
regression models including a linear model, Bayesian Ridge regression, as well
as several non-linear models, Decision Trees, Extra Trees and K-Neighbours
regression. We evaluate the models using the traditional mean square error
towards removed available data, and consider in addition the KL divergence
between the observed and the imputed distributions. We show that all of the
multivariate approaches tested provide a substantial improvement compared to
typical univariate approaches. Further, our analyses reveal that across all 15
data-subsets tested, an Extra Trees regression approach provided the best
global results. This allows the selection of a unique model to impute missing
data for the LEAP project and deliver a fixed set of imputed clinical data to
be used by researchers working with the LEAP dataset in the future.
Related papers
- Bayesian Federated Inference for regression models based on non-shared multicenter data sets from heterogeneous populations [0.0]
In a regression model, the sample size must be large enough relative to the number of possible predictors.
Pooling data from different data sets collected in different (medical) centers would alleviate this problem, but is often not feasible due to privacy regulation or logistic problems.
An alternative route would be to analyze the local data in the centers separately and combine the statistical inference results with the Bayesian Federated Inference (BFI) methodology.
The aim of this approach is to compute from the inference results in separate centers what would have been found if the statistical analysis was performed on the combined data.
arXiv Detail & Related papers (2024-02-05T11:10:27Z) - An AI-Guided Data Centric Strategy to Detect and Mitigate Biases in
Healthcare Datasets [32.25265709333831]
We generate a data-centric, model-agnostic, task-agnostic approach to evaluate dataset bias by investigating the relationship between how easily different groups are learned at small sample sizes (AEquity)
We then apply a systematic analysis of AEq values across subpopulations to identify and manifestations of racial bias in two known cases in healthcare.
AEq is a novel and broadly applicable metric that can be applied to advance equity by diagnosing and remediating bias in healthcare datasets.
arXiv Detail & Related papers (2023-11-06T17:08:41Z) - DCID: Deep Canonical Information Decomposition [84.59396326810085]
We consider the problem of identifying the signal shared between two one-dimensional target variables.
We propose ICM, an evaluation metric which can be used in the presence of ground-truth labels.
We also propose Deep Canonical Information Decomposition (DCID) - a simple, yet effective approach for learning the shared variables.
arXiv Detail & Related papers (2023-06-27T16:59:06Z) - A method for comparing multiple imputation techniques: a case study on
the U.S. National COVID Cohort Collaborative [1.259457977936316]
We numerically evaluate strategies for handling missing data in the context of statistical analysis.
Our approach could effectively highlight the most valid and performant missing-data handling strategy.
arXiv Detail & Related papers (2022-06-13T19:49:54Z) - Equivariance Allows Handling Multiple Nuisance Variables When Analyzing
Pooled Neuroimaging Datasets [53.34152466646884]
In this paper, we show how bringing recent results on equivariant representation learning instantiated on structured spaces together with simple use of classical results on causal inference provides an effective practical solution.
We demonstrate how our model allows dealing with more than one nuisance variable under some assumptions and can enable analysis of pooled scientific datasets in scenarios that would otherwise entail removing a large portion of the samples.
arXiv Detail & Related papers (2022-03-29T04:54:06Z) - Lung Cancer Risk Estimation with Incomplete Data: A Joint Missing
Imputation Perspective [5.64530854079352]
We address imputation of missing data by modeling the joint distribution of multi-modal data.
Motivated by partial bidirectional generative adversarial net (PBiGAN), we propose a new Conditional PBiGAN (C-PBiGAN) method.
C-PBiGAN achieves significant improvements in lung cancer risk estimation compared with representative imputation methods.
arXiv Detail & Related papers (2021-07-25T20:15:16Z) - Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model.
We introduce two unique positive sampling strategies specifically tailored for EHR data.
Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z) - Adversarial Sample Enhanced Domain Adaptation: A Case Study on
Predictive Modeling with Electronic Health Records [57.75125067744978]
We propose a data augmentation method to facilitate domain adaptation.
adversarially generated samples are used during domain adaptation.
Results confirm the effectiveness of our method and the generality on different tasks.
arXiv Detail & Related papers (2021-01-13T03:20:20Z) - Mixture Model Framework for Traumatic Brain Injury Prognosis Using
Heterogeneous Clinical and Outcome Data [3.7363119896212478]
We develop a method for modeling large heterogeneous data types relevant to TBI.
The model is trained on a dataset encompassing a variety of data types, including demographics, blood-based biomarkers, and imaging findings.
It is used to stratify patients into distinct groups in an unsupervised learning setting.
arXiv Detail & Related papers (2020-12-22T19:31:03Z) - Two-step penalised logistic regression for multi-omic data with an
application to cardiometabolic syndrome [62.997667081978825]
We implement a two-step approach to multi-omic logistic regression in which variable selection is performed on each layer separately.
Our approach should be preferred if the goal is to select as many relevant predictors as possible.
Our proposed approach allows us to identify features that characterise cardiometabolic syndrome at the molecular level.
arXiv Detail & Related papers (2020-08-01T10:36:27Z) - Predictive Modeling of ICU Healthcare-Associated Infections from
Imbalanced Data. Using Ensembles and a Clustering-Based Undersampling
Approach [55.41644538483948]
This work is focused on both the identification of risk factors and the prediction of healthcare-associated infections in intensive-care units.
The aim is to support decision making addressed at reducing the incidence rate of infections.
arXiv Detail & Related papers (2020-05-07T16:13:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.