Handling missing values in healthcare data: A systematic review of deep
learning-based imputation techniques
- URL: http://arxiv.org/abs/2210.08258v1
- Date: Sat, 15 Oct 2022 11:11:20 GMT
- Title: Handling missing values in healthcare data: A systematic review of deep
learning-based imputation techniques
- Authors: Mingxuan Liu, Siqi Li, Han Yuan, Marcus Eng Hock Ong, Yilin Ning, Feng
Xie, Seyed Ehsan Saffari, Victor Volovici, Bibhas Chakraborty, Nan Liu
- Abstract summary: The proper handling of missing values is critical to delivering reliable estimates and decisions.
The increasing diversity and complexity of data have led many researchers to develop deep learning (DL)-based imputation techniques.
- Score: 9.400097064676991
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Objective: The proper handling of missing values is critical to delivering
reliable estimates and decisions, especially in high-stakes fields such as
clinical research. The increasing diversity and complexity of data have led
many researchers to develop deep learning (DL)-based imputation techniques. We
conducted a systematic review to evaluate the use of these techniques, with a
particular focus on data types, aiming to assist healthcare researchers from
various disciplines in dealing with missing values.
Methods: We searched five databases (MEDLINE, Web of Science, Embase, CINAHL,
and Scopus) for articles published prior to August 2021 that applied DL-based
models to imputation. We assessed selected publications from four perspectives:
health data types, model backbone (i.e., main architecture), imputation
strategies, and comparison with non-DL-based methods. Based on data types, we
created an evidence map to illustrate the adoption of DL models.
Results: We included 64 articles, of which tabular static (26.6%, 17/64) and
temporal data (37.5%, 24/64) were the most frequently investigated. We found
that model backbone(s) differed among data types as well as the imputation
strategy. The "integrated" strategy, that is, the imputation task being solved
concurrently with downstream tasks, was popular for tabular temporal (50%,
12/24) and multi-modal data (71.4%, 5/7), but limited for other data types.
Moreover, DL-based imputation methods yielded better imputation accuracy in
most studies, compared with non-DL-based methods.
Conclusion: DL-based imputation models can be customized based on data type,
addressing the corresponding missing patterns, and its associated "integrated"
strategy can enhance the efficacy of imputation, especially in scenarios where
data is complex. Future research may focus on the portability and fairness of
DL-based models for healthcare data imputation.
Related papers
- Lessons Learned on Information Retrieval in Electronic Health Records: A Comparison of Embedding Models and Pooling Strategies [8.822087602255504]
Applying large language models to the clinical domain is challenging due to the context-heavy nature of processing medical records.
This paper explores how different embedding models and pooling methods affect information retrieval for the clinical domain.
arXiv Detail & Related papers (2024-09-23T16:16:08Z) - A Comprehensive Survey on Data Augmentation [55.355273602421384]
Data augmentation is a technique that generates high-quality artificial data by manipulating existing data samples.
Existing literature surveys only focus on a certain type of specific modality data.
We propose a more enlightening taxonomy that encompasses data augmentation techniques for different common data modalities.
arXiv Detail & Related papers (2024-05-15T11:58:08Z) - In-Database Data Imputation [0.6157028677798809]
Missing data is a widespread problem in many domains, creating challenges in data analysis and decision making.
Traditional techniques for dealing with missing data, such as excluding incomplete records or imputing simple estimates, are computationally efficient but may introduce bias and disrupt variable relationships.
Model-based imputation techniques offer a more robust solution that preserves the variability and relationships in the data, but they demand significantly more computation time.
This work enables efficient, high-quality, and scalable data imputation within a database system using the widely used MICE method.
arXiv Detail & Related papers (2024-01-07T01:57:41Z) - Deep Imputation of Missing Values in Time Series Health Data: A Review
with Benchmarking [0.0]
This survey performs six data-centric experiments to benchmark state-of-the-art deep imputation methods on five time series health data sets.
Deep learning methods that jointly perform cross-sectional (across variables) and longitudinal (across time) imputations of missing values in time series data yield statistically better data quality than traditional imputation methods.
arXiv Detail & Related papers (2023-02-10T16:03:36Z) - HyperImpute: Generalized Iterative Imputation with Automatic Model
Selection [77.86861638371926]
We propose a generalized iterative imputation framework for adaptively and automatically configuring column-wise models.
We provide a concrete implementation with out-of-the-box learners, simulators, and interfaces.
arXiv Detail & Related papers (2022-06-15T19:10:35Z) - Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class.
This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples.
Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z) - Categorical EHR Imputation with Generative Adversarial Nets [11.171712535005357]
We propose a simple and yet effective approach that is based on previous work on GANs for data imputation.
We show that our imputation approach largely improves the prediction accuracy, compared to more traditional data imputation approaches.
arXiv Detail & Related papers (2021-08-03T18:50:26Z) - CSDI: Conditional Score-based Diffusion Models for Probabilistic Time
Series Imputation [107.63407690972139]
Conditional Score-based Diffusion models for Imputation (CSDI) is a novel time series imputation method that utilizes score-based diffusion models conditioned on observed data.
CSDI improves by 40-70% over existing probabilistic imputation methods on popular performance metrics.
In addition, C reduces the error by 5-20% compared to the state-of-the-art deterministic imputation methods.
arXiv Detail & Related papers (2021-07-07T22:20:24Z) - Handling Non-ignorably Missing Features in Electronic Health Records
Data Using Importance-Weighted Autoencoders [8.518166245293703]
We propose a novel extension of VAEs called Importance-Weighted Autoencoders (IWAEs) to flexibly handle Missing Not At Random patterns in the Physionet data.
Our proposed method models the missingness mechanism using an embedded neural network, eliminating the need to specify the exact form of the missingness mechanism a priori.
arXiv Detail & Related papers (2021-01-18T22:53:29Z) - Adversarial Sample Enhanced Domain Adaptation: A Case Study on
Predictive Modeling with Electronic Health Records [57.75125067744978]
We propose a data augmentation method to facilitate domain adaptation.
adversarially generated samples are used during domain adaptation.
Results confirm the effectiveness of our method and the generality on different tasks.
arXiv Detail & Related papers (2021-01-13T03:20:20Z) - CHEER: Rich Model Helps Poor Model via Knowledge Infusion [69.23072792708263]
We develop a knowledge infusion framework named CHEER that can succinctly summarize such rich model into transferable representations.
Our empirical results showed that CHEER outperformed baselines by 5.60% to 46.80% in terms of the macro-F1 score on multiple physiological datasets.
arXiv Detail & Related papers (2020-05-21T21:44:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.