Missing Value Estimation using Clustering and Deep Learning within
Multiple Imputation Framework
- URL: http://arxiv.org/abs/2202.13734v1
- Date: Mon, 28 Feb 2022 13:02:44 GMT
- Title: Missing Value Estimation using Clustering and Deep Learning within
Multiple Imputation Framework
- Authors: Manar D Samad, Sakib Abrar, Norou Diawara
- Abstract summary: The most popular imputation algorithm is arguably multiple imputations using chains of equations (MICE)
This paper proposes methods to improve both the imputation accuracy of MICE and the classification accuracy of imputed data.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Missing values in tabular data restrict the use and performance of machine
learning, requiring the imputation of missing values. The most popular
imputation algorithm is arguably multiple imputations using chains of equations
(MICE), which estimates missing values from linear conditioning on observed
values. This paper proposes methods to improve both the imputation accuracy of
MICE and the classification accuracy of imputed data by replacing MICE's linear
conditioning with ensemble learning and deep neural networks (DNN). The
imputation accuracy is further improved by characterizing individual samples
with cluster labels (CISCL) obtained from the training data. Our extensive
analyses involving six tabular data sets, up to 80% missingness, and three
missingness types (missing completely at random, missing at random, missing not
at random) reveal that ensemble or deep learning within MICE is superior to the
baseline MICE (b-MICE), both of which are consistently outperformed by CISCL.
Results show that CISCL plus b-MICE outperforms b-MICE for all percentages and
types of missingness. Our proposed DNN based MICE and gradient boosting MICE
plus CISCL (GB-MICE-CISCL) outperform seven other baseline imputation
algorithms in most experimental cases. The classification accuracy on the data
imputed by GB-MICE is improved by proposed GB-MICE-CISCL imputed data across
all missingness percentages. Results also reveal a shortcoming of the MICE
framework at high missingness (>50%) and when the missing type is not random.
This paper provides a generalized approach to identifying the best imputation
model for a data set with a missingness percentage and type.
Related papers
- On the Performance of Imputation Techniques for Missing Values on Healthcare Datasets [0.0]
Missing values or data is one popular characteristic of real-world datasets, especially healthcare data.
This study is to compare the performance of seven imputation techniques, namely Mean imputation, Median Imputation, Last Observation carried Forward (LOCF) imputation, K-Nearest Neighbor (KNN) imputation, Interpolation imputation, Missforest imputation, and Multiple imputation by Chained Equations (MICE)
The results show that Missforest imputation performs the best followed by MICE imputation.
arXiv Detail & Related papers (2024-03-13T18:07:17Z) - Machine Learning Based Missing Values Imputation in Categorical Datasets [2.5611256859404983]
This research looked into the use of machine learning algorithms to fill in the gaps in categorical datasets.
The emphasis was on ensemble models constructed using the Error Correction Output Codes framework.
Deep learning for missing data imputation has obstacles despite these encouraging results, including the requirement for large amounts of labeled data.
arXiv Detail & Related papers (2023-06-10T03:29:48Z) - MISNN: Multiple Imputation via Semi-parametric Neural Networks [9.594714330925703]
Multiple imputation (MI) has been widely applied to missing value problems in biomedical, social and econometric research.
We propose MISNN, a novel and efficient algorithm that incorporates feature selection for MI.
arXiv Detail & Related papers (2023-05-02T21:45:36Z) - To Impute or not to Impute? -- Missing Data in Treatment Effect
Estimation [84.76186111434818]
We identify a new missingness mechanism, which we term mixed confounded missingness (MCM), where some missingness determines treatment selection and other missingness is determined by treatment selection.
We show that naively imputing all data leads to poor performing treatment effects models, as the act of imputation effectively removes information necessary to provide unbiased estimates.
Our solution is selective imputation, where we use insights from MCM to inform precisely which variables should be imputed and which should not.
arXiv Detail & Related papers (2022-02-04T12:08:31Z) - MIRACLE: Causally-Aware Imputation via Learning Missing Data Mechanisms [82.90843777097606]
We propose a causally-aware imputation algorithm (MIRACLE) for missing data.
MIRACLE iteratively refines the imputation of a baseline by simultaneously modeling the missingness generating mechanism.
We conduct extensive experiments on synthetic and a variety of publicly available datasets to show that MIRACLE is able to consistently improve imputation.
arXiv Detail & Related papers (2021-11-04T22:38:18Z) - Imputation-Free Learning from Incomplete Observations [73.15386629370111]
We introduce the importance of guided gradient descent (IGSGD) method to train inference from inputs containing missing values without imputation.
We employ reinforcement learning (RL) to adjust the gradients used to train the models via back-propagation.
Our imputation-free predictions outperform the traditional two-step imputation-based predictions using state-of-the-art imputation methods.
arXiv Detail & Related papers (2021-07-05T12:44:39Z) - Model-based clustering of partial records [11.193504036335503]
We develop clustering methodology through a model-based approach using the marginal density for the observed values.
We compare our algorithm to the corresponding full expectation-maximization (EM) approach that considers the missing values in the incomplete data set.
Simulation studies demonstrate that our approach has favorable recovery of the true cluster partition compared to case deletion and imputation.
arXiv Detail & Related papers (2021-03-30T13:30:59Z) - Learning by Minimizing the Sum of Ranked Range [58.24935359348289]
We introduce the sum of ranked range (SoRR) as a general approach to form learning objectives.
A ranked range is a consecutive sequence of sorted values of a set of real numbers.
We explore two applications in machine learning of the minimization of the SoRR framework, namely the AoRR aggregate loss for binary classification and the TKML individual loss for multi-label/multi-class classification.
arXiv Detail & Related papers (2020-10-05T01:58:32Z) - Deep F-measure Maximization for End-to-End Speech Understanding [52.36496114728355]
We propose a differentiable approximation to the F-measure and train the network with this objective using standard backpropagation.
We perform experiments on two standard fairness datasets, Adult, Communities and Crime, and also on speech-to-intent detection on the ATIS dataset and speech-to-image concept classification on the Speech-COCO dataset.
In all four of these tasks, F-measure results in improved micro-F1 scores, with absolute improvements of up to 8% absolute, as compared to models trained with the cross-entropy loss function.
arXiv Detail & Related papers (2020-08-08T03:02:27Z) - ELMV: an Ensemble-Learning Approach for Analyzing Electrical Health
Records with Significant Missing Values [4.9810955364960385]
We propose a novel Ensemble-Learning for Missing Value (ELMV) framework, which introduces an effective approach to construct multiple subsets of the original EHR data with a much lower missing rate.
ELMV has been evaluated on a real-world healthcare data for critical feature identification as well as a batch of simulation data with different missing rates for outcome prediction.
arXiv Detail & Related papers (2020-06-25T06:29:55Z) - Diversity inducing Information Bottleneck in Model Ensembles [73.80615604822435]
In this paper, we target the problem of generating effective ensembles of neural networks by encouraging diversity in prediction.
We explicitly optimize a diversity inducing adversarial loss for learning latent variables and thereby obtain diversity in the output predictions necessary for modeling multi-modal data.
Compared to the most competitive baselines, we show significant improvements in classification accuracy, under a shift in the data distribution.
arXiv Detail & Related papers (2020-03-10T03:10:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.