Comparison of Missing Data Imputation Methods using the Framingham Heart
study dataset
- URL: http://arxiv.org/abs/2210.03154v2
- Date: Mon, 10 Oct 2022 07:22:00 GMT
- Title: Comparison of Missing Data Imputation Methods using the Framingham Heart
study dataset
- Authors: Konstantinos Psychogyios, Loukas Ilias, Dimitris Askounis
- Abstract summary: We test and modify state-of-the-art missing value imputation methods based on Generative Adversarial Networks (GANs) and Autoencoders.
The evaluation is accomplished for both the tasks of data imputation and post-imputation prediction.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Cardiovascular disease (CVD) is a class of diseases that involve the heart or
blood vessels and according to World Health Organization is the leading cause
of death worldwide. EHR data regarding this case, as well as medical cases in
general, contain missing values very frequently. The percentage of missingness
may vary and is linked with instrument errors, manual data entry procedures,
etc. Even though the missing rate is usually significant, in many cases the
missing value imputation part is handled poorly either with case-deletion or
with simple statistical approaches such as mode and median imputation. These
methods are known to introduce significant bias, since they do not account for
the relationships between the dataset's variables. Within the medical
framework, many datasets consist of lab tests or patient medical tests, where
these relationships are present and strong. To address these limitations, in
this paper we test and modify state-of-the-art missing value imputation methods
based on Generative Adversarial Networks (GANs) and Autoencoders. The
evaluation is accomplished for both the tasks of data imputation and
post-imputation prediction. Regarding the imputation task, we achieve
improvements of 0.20, 7.00% in normalised Root Mean Squared Error (RMSE) and
Area Under the Receiver Operating Characteristic Curve (AUROC) respectively. In
terms of the post-imputation prediction task, our models outperform the
standard approaches by 2.50% in F1-score.
Related papers
- FedCVD: The First Real-World Federated Learning Benchmark on Cardiovascular Disease Data [52.55123685248105]
Cardiovascular diseases (CVDs) are currently the leading cause of death worldwide, highlighting the critical need for early diagnosis and treatment.
Machine learning (ML) methods can help diagnose CVDs early, but their performance relies on access to substantial data with high quality.
This paper presents the first real-world FL benchmark for cardiovascular disease detection, named FedCVD.
arXiv Detail & Related papers (2024-10-28T02:24:01Z) - On the Performance of Imputation Techniques for Missing Values on Healthcare Datasets [0.0]
Missing values or data is one popular characteristic of real-world datasets, especially healthcare data.
This study is to compare the performance of seven imputation techniques, namely Mean imputation, Median Imputation, Last Observation carried Forward (LOCF) imputation, K-Nearest Neighbor (KNN) imputation, Interpolation imputation, Missforest imputation, and Multiple imputation by Chained Equations (MICE)
The results show that Missforest imputation performs the best followed by MICE imputation.
arXiv Detail & Related papers (2024-03-13T18:07:17Z) - What is Hiding in Medicine's Dark Matter? Learning with Missing Data in
Medical Practices [38.64139739520114]
Missing data may be linked to health care professional practice patterns.
We have examined 79 TARN fields with missing values for 5,791 trauma cases.
We have concluded that the 1NN imputer is the best imputation which indicates a usual pattern of clinical decision making.
arXiv Detail & Related papers (2024-02-09T17:27:35Z) - An Improved Heart Disease Prediction Using Stacked Ensemble Method [0.9187159782788579]
We constructed an ML-based diagnostic system for heart illness forecasting, using a heart disorder dataset.
Our method can easily differentiate between people who have cardiac disease and those who are normal.
arXiv Detail & Related papers (2023-04-12T17:53:59Z) - Density-Aware Personalized Training for Risk Prediction in Imbalanced
Medical Data [89.79617468457393]
Training models with imbalance rate (class density discrepancy) may lead to suboptimal prediction.
We propose a framework for training models for this imbalance issue.
We demonstrate our model's improved performance in real-world medical datasets.
arXiv Detail & Related papers (2022-07-23T00:39:53Z) - Practical Challenges in Differentially-Private Federated Survival
Analysis of Medical Data [57.19441629270029]
In this paper, we take advantage of the inherent properties of neural networks to federate the process of training of survival analysis models.
In the realistic setting of small medical datasets and only a few data centers, this noise makes it harder for the models to converge.
We propose DPFed-post which adds a post-processing stage to the private federated learning scheme.
arXiv Detail & Related papers (2022-02-08T10:03:24Z) - To Impute or not to Impute? -- Missing Data in Treatment Effect
Estimation [84.76186111434818]
We identify a new missingness mechanism, which we term mixed confounded missingness (MCM), where some missingness determines treatment selection and other missingness is determined by treatment selection.
We show that naively imputing all data leads to poor performing treatment effects models, as the act of imputation effectively removes information necessary to provide unbiased estimates.
Our solution is selective imputation, where we use insights from MCM to inform precisely which variables should be imputed and which should not.
arXiv Detail & Related papers (2022-02-04T12:08:31Z) - A Graph-based Imputation Method for Sparse Medical Records [3.136861161060886]
We propose a graph-based imputation method that is robust to sparsity and to unreliable unmeasured events.
Results indicate that the model learns to embed different event types in a clinically meaningful way.
arXiv Detail & Related papers (2021-11-17T13:06:08Z) - Imputation-Free Learning from Incomplete Observations [73.15386629370111]
We introduce the importance of guided gradient descent (IGSGD) method to train inference from inputs containing missing values without imputation.
We employ reinforcement learning (RL) to adjust the gradients used to train the models via back-propagation.
Our imputation-free predictions outperform the traditional two-step imputation-based predictions using state-of-the-art imputation methods.
arXiv Detail & Related papers (2021-07-05T12:44:39Z) - Medical data wrangling with sequential variational autoencoders [5.9207487081080705]
This paper proposes to model medical data records with heterogeneous data types and bursty missing data using sequential variational autoencoders (VAEs)
We show that Shi-VAE achieves the best performance in terms of using both metrics, with lower computational complexity than the GP-VAE model.
arXiv Detail & Related papers (2021-03-12T10:59:26Z) - A random shuffle method to expand a narrow dataset and overcome the
associated challenges in a clinical study: a heart failure cohort example [50.591267188664666]
The aim of this study was to design a random shuffle method to enhance the cardinality of an HF dataset while it is statistically legitimate.
The proposed random shuffle method was able to enhance the HF dataset cardinality circa 10 times and circa 21 times when followed by a random repeated-measures approach.
arXiv Detail & Related papers (2020-12-12T10:59:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.