Related papers: On the Performance of Imputation Techniques for Missing Values on Healthcare Datasets

On the Performance of Imputation Techniques for Missing Values on Healthcare Datasets

URL: http://arxiv.org/abs/2403.14687v1
Date: Wed, 13 Mar 2024 18:07:17 GMT
Title: On the Performance of Imputation Techniques for Missing Values on Healthcare Datasets
Authors: Luke Oluwaseye Joel, Wesley Doorsamy, Babu Sena Paul,
Abstract summary: Missing values or data is one popular characteristic of real-world datasets, especially healthcare data. This study is to compare the performance of seven imputation techniques, namely Mean imputation, Median Imputation, Last Observation carried Forward (LOCF) imputation, K-Nearest Neighbor (KNN) imputation, Interpolation imputation, Missforest imputation, and Multiple imputation by Chained Equations (MICE) The results show that Missforest imputation performs the best followed by MICE imputation.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Missing values or data is one popular characteristic of real-world datasets, especially healthcare data. This could be frustrating when using machine learning algorithms on such datasets, simply because most machine learning models perform poorly in the presence of missing values. The aim of this study is to compare the performance of seven imputation techniques, namely Mean imputation, Median Imputation, Last Observation carried Forward (LOCF) imputation, K-Nearest Neighbor (KNN) imputation, Interpolation imputation, Missforest imputation, and Multiple imputation by Chained Equations (MICE), on three healthcare datasets. Some percentage of missing values - 10\%, 15\%, 20\% and 25\% - were introduced into the dataset, and the imputation techniques were employed to impute these missing values. The comparison of their performance was evaluated by using root mean squared error (RMSE) and mean absolute error (MAE). The results show that Missforest imputation performs the best followed by MICE imputation. Additionally, we try to determine whether it is better to perform feature selection before imputation or vice versa by using the following metrics - the recall, precision, f1-score and accuracy. Due to the fact that there are few literature on this and some debate on the subject among researchers, we hope that the results from this experiment will encourage data scientists and researchers to perform imputation first before feature selection when dealing with data containing missing values.

Related papers

DUPRE: Data Utility Prediction for Efficient Data Valuation [49.60564885180563]
Cooperative game theory-based data valuation, such as Data Shapley, requires evaluating the data utility and retraining the ML model for multiple data subsets. Our framework, textttDUPRE, takes an alternative yet complementary approach that reduces the cost per subset evaluation by predicting data utilities instead of evaluating them by model retraining. Specifically, given the evaluated data utilities of some data subsets, textttDUPRE fits a emphGaussian process (GP) regression model to predict the utility of every other data subset.
arXiv Detail & Related papers (2025-02-22T08:53:39Z)
Attribute-to-Delete: Machine Unlearning via Datamodel Matching [65.13151619119782]
Machine unlearning -- efficiently removing a small "forget set" training data on a pre-divertrained machine learning model -- has recently attracted interest. Recent research shows that machine unlearning techniques do not hold up in such a challenging setting.
arXiv Detail & Related papers (2024-10-30T17:20:10Z)
Machine Learning Based Missing Values Imputation in Categorical Datasets [2.5611256859404983]
This research looked into the use of machine learning algorithms to fill in the gaps in categorical datasets. The emphasis was on ensemble models constructed using the Error Correction Output Codes framework. Deep learning for missing data imputation has obstacles despite these encouraging results, including the requirement for large amounts of labeled data.
arXiv Detail & Related papers (2023-06-10T03:29:48Z)
IRTCI: Item Response Theory for Categorical Imputation [5.9952530228468754]
Several imputation techniques have been designed to replace missing data with stand in values. The work showcased here offers a novel means for categorical imputation based on item response theory (IRT) Analyses comparing these techniques were performed on three different datasets.
arXiv Detail & Related papers (2023-02-08T16:17:20Z)
Missing Value Estimation using Clustering and Deep Learning within Multiple Imputation Framework [0.0]
The most popular imputation algorithm is arguably multiple imputations using chains of equations (MICE) This paper proposes methods to improve both the imputation accuracy of MICE and the classification accuracy of imputed data.
arXiv Detail & Related papers (2022-02-28T13:02:44Z)
Benchmarking missing-values approaches for predictive models on health databases [47.187609203210705]
We conduct a benchmark of missing-values strategies in predictive models with a focus on large health databases. We find that native support for missing values in supervised machine learning predicts better than state-of-the-art imputation with much less computational cost.
arXiv Detail & Related papers (2022-02-17T09:40:04Z)
To Impute or not to Impute? -- Missing Data in Treatment Effect Estimation [84.76186111434818]
We identify a new missingness mechanism, which we term mixed confounded missingness (MCM), where some missingness determines treatment selection and other missingness is determined by treatment selection. We show that naively imputing all data leads to poor performing treatment effects models, as the act of imputation effectively removes information necessary to provide unbiased estimates. Our solution is selective imputation, where we use insights from MCM to inform precisely which variables should be imputed and which should not.
arXiv Detail & Related papers (2022-02-04T12:08:31Z)
MIRACLE: Causally-Aware Imputation via Learning Missing Data Mechanisms [82.90843777097606]
We propose a causally-aware imputation algorithm (MIRACLE) for missing data. MIRACLE iteratively refines the imputation of a baseline by simultaneously modeling the missingness generating mechanism. We conduct extensive experiments on synthetic and a variety of publicly available datasets to show that MIRACLE is able to consistently improve imputation.
arXiv Detail & Related papers (2021-11-04T22:38:18Z)
Doing Great at Estimating CATE? On the Neglected Assumptions in Benchmark Comparisons of Treatment Effect Estimators [91.3755431537592]
We show that even in arguably the simplest setting, estimation under ignorability assumptions can be misleading. We consider two popular machine learning benchmark datasets for evaluation of heterogeneous treatment effect estimators. We highlight that the inherent characteristics of the benchmark datasets favor some algorithms over others.
arXiv Detail & Related papers (2021-07-28T13:21:27Z)
Imputation-Free Learning from Incomplete Observations [73.15386629370111]
We introduce the importance of guided gradient descent (IGSGD) method to train inference from inputs containing missing values without imputation. We employ reinforcement learning (RL) to adjust the gradients used to train the models via back-propagation. Our imputation-free predictions outperform the traditional two-step imputation-based predictions using state-of-the-art imputation methods.
arXiv Detail & Related papers (2021-07-05T12:44:39Z)
Missing Data Imputation for Classification Problems [1.52292571922932]
Imputation of missing data is a common application in various classification problems where the feature training matrix has missingness. In this paper, we propose a novel iterative kNN imputation technique based on class weighted grey distance. This ensures that the imputation of the training data is directed towards improving classification performance.
arXiv Detail & Related papers (2020-02-25T07:48:45Z)
Missing Data Imputation using Optimal Transport [43.14084843713895]
We leverage optimal transport distances to quantify a criterion and turn it into a loss function to impute missing data values. We propose practical methods to minimize these losses using end-to-end learning. These experiments show that OT-based methods match or out-perform state-of-the-art imputation methods, even for high percentages of missing values.
arXiv Detail & Related papers (2020-02-10T15:23:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.