Are deep learning models superior for missing data imputation in large
surveys? Evidence from an empirical comparison
- URL: http://arxiv.org/abs/2103.09316v1
- Date: Sun, 14 Mar 2021 16:24:04 GMT
- Title: Are deep learning models superior for missing data imputation in large
surveys? Evidence from an empirical comparison
- Authors: Zhenhua Wang, Olanrewaju Akande, Jason Poulos and Fan Li
- Abstract summary: Multiple imputation (MI) is the state-of-the-art approach for dealing with missing data arising from non-response in sample surveys.
Recent MI methods based on deep learning models have been developed with encouraging results in small studies.
This paper provides a framework for using simulations based on real survey data and several performance metrics to compare MI methods.
- Score: 5.994312110645453
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multiple imputation (MI) is the state-of-the-art approach for dealing with
missing data arising from non-response in sample surveys. Multiple imputation
by chained equations (MICE) is the most widely used MI method, but it lacks
theoretical foundation and is computationally intensive. Recently, MI methods
based on deep learning models have been developed with encouraging results in
small studies. However, there has been limited research on systematically
evaluating their performance in realistic settings comparing to MICE,
particularly in large-scale surveys. This paper provides a general framework
for using simulations based on real survey data and several performance metrics
to compare MI methods. We conduct extensive simulation studies based on the
American Community Survey data to compare repeated sampling properties of four
machine learning based MI methods: MICE with classification trees, MICE with
random forests, generative adversarial imputation network, and multiple
imputation using denoising autoencoders. We find the deep learning based MI
methods dominate MICE in terms of computational time; however, MICE with
classification trees consistently outperforms the deep learning MI methods in
terms of bias, mean squared error, and coverage under a range of realistic
settings.
Related papers
- Detecting Training Data of Large Language Models via Expectation Maximization [62.28028046993391]
Membership inference attacks (MIAs) aim to determine whether a specific instance was part of a target model's training data.
Applying MIAs to large language models (LLMs) presents unique challenges due to the massive scale of pre-training data and the ambiguous nature of membership.
We introduce EM-MIA, a novel MIA method for LLMs that iteratively refines membership scores and prefix scores via an expectation-maximization algorithm.
arXiv Detail & Related papers (2024-10-10T03:31:16Z) - Masked Image Modeling: A Survey [73.21154550957898]
Masked image modeling emerged as a powerful self-supervised learning technique in computer vision.
We construct a taxonomy and review the most prominent papers in recent years.
We aggregate the performance results of various masked image modeling methods on the most popular datasets.
arXiv Detail & Related papers (2024-08-13T07:27:02Z) - Evaluation of Missing Data Analytical Techniques in Longitudinal Research: Traditional and Machine Learning Approaches [11.048092826888412]
This study utilizes Monte Carlo simulations to assess and compare the effectiveness of six analytical techniques for missing data within the growth curve modeling framework.
We investigate the influence of sample size, missing data rate, missing data mechanism, and data distribution on the accuracy and efficiency of model estimation.
arXiv Detail & Related papers (2024-06-19T20:20:30Z) - Querying Easily Flip-flopped Samples for Deep Active Learning [63.62397322172216]
Active learning is a machine learning paradigm that aims to improve the performance of a model by strategically selecting and querying unlabeled data.
One effective selection strategy is to base it on the model's predictive uncertainty, which can be interpreted as a measure of how informative a sample is.
This paper proposes the it least disagree metric (LDM) as the smallest probability of disagreement of the predicted label.
arXiv Detail & Related papers (2024-01-18T08:12:23Z) - Evaluating tree-based imputation methods as an alternative to MICE PMM
for drawing inference in empirical studies [0.5892638927736115]
Dealing with missing data is an important problem in statistical analysis that is often addressed with imputation procedures.
The prevailing method of Multiple Imputation by Chained Equations with Predictive Mean Matching (PMM) is considered standard in the social science literature.
In particular, tree-based imputation methods have emerged as very competitive approaches.
arXiv Detail & Related papers (2024-01-17T21:28:00Z) - Minimally Supervised Learning using Topological Projections in
Self-Organizing Maps [55.31182147885694]
We introduce a semi-supervised learning approach based on topological projections in self-organizing maps (SOMs)
Our proposed method first trains SOMs on unlabeled data and then a minimal number of available labeled data points are assigned to key best matching units (BMU)
Our results indicate that the proposed minimally supervised model significantly outperforms traditional regression techniques.
arXiv Detail & Related papers (2024-01-12T22:51:48Z) - Learning-Based Difficulty Calibration for Enhanced Membership Inference Attacks [3.470379197911889]
Membership Inference Attacks (MIA) allows adversaries to determine whether a specific data point was part of a model's training dataset.
We present a novel approach to MIA that is aimed at significantly improving TPR at low False Positive Rate (FPR)
Experiment results show that LDC-MIA can improve TPR at low FPR by up to 4x compared to the other difficulty calibration based MIAs.
arXiv Detail & Related papers (2024-01-10T04:58:17Z) - Multiple Imputation with Neural Network Gaussian Process for
High-dimensional Incomplete Data [9.50726756006467]
Imputation is arguably the most popular method for handling missing data, though existing methods have a number of limitations.
We propose two NNGP-based MI methods, namely MI-NNGP, that can apply multiple imputations for missing values from a joint (posterior predictive) distribution.
The MI-NNGP methods are shown to significantly outperform existing state-of-the-art methods on synthetic and real datasets.
arXiv Detail & Related papers (2022-11-23T20:54:26Z) - Multiple Imputation via Generative Adversarial Network for
High-dimensional Blockwise Missing Value Problems [6.123324869194195]
We propose Multiple Imputation via Generative Adversarial Network (MI-GAN), a deep learning-based (in specific, a GAN-based) multiple imputation method.
MI-GAN shows strong performance matching existing state-of-the-art imputation methods on high-dimensional datasets.
In particular, MI-GAN significantly outperforms other imputation methods in the sense of statistical inference and computational speed.
arXiv Detail & Related papers (2021-12-21T20:19:37Z) - CLUB: A Contrastive Log-ratio Upper Bound of Mutual Information [105.73798100327667]
We propose a novel Contrastive Log-ratio Upper Bound (CLUB) of mutual information.
We provide a theoretical analysis of the properties of CLUB and its variational approximation.
Based on this upper bound, we introduce a MI minimization training scheme and further accelerate it with a negative sampling strategy.
arXiv Detail & Related papers (2020-06-22T05:36:16Z) - Mutual Information Gradient Estimation for Representation Learning [56.08429809658762]
Mutual Information (MI) plays an important role in representation learning.
Recent advances establish tractable and scalable MI estimators to discover useful representation.
We propose the Mutual Information Gradient Estimator (MIGE) for representation learning based on the score estimation of implicit distributions.
arXiv Detail & Related papers (2020-05-03T16:05:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.