Conditional expectation with regularization for missing data imputation
- URL: http://arxiv.org/abs/2302.00911v3
- Date: Mon, 11 Sep 2023 07:41:52 GMT
- Title: Conditional expectation with regularization for missing data imputation
- Authors: Mai Anh Vu, Thu Nguyen, Tu T. Do, Nhan Phan, Nitesh V. Chawla, P{\aa}l
Halvorsen, Michael A. Riegler, Binh T. Nguyen
- Abstract summary: Missing data frequently occurs in datasets across various domains, such as medicine, sports, and finance.
We propose a new algorithm named "conditional Distribution-based Imputation of Missing Values with Regularization" (DIMV)
DIMV operates by determining the conditional distribution of a feature that has missing entries, using the information from the fully observed features as a basis.
- Score: 19.254291863337347
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Missing data frequently occurs in datasets across various domains, such as
medicine, sports, and finance. In many cases, to enable proper and reliable
analyses of such data, the missing values are often imputed, and it is
necessary that the method used has a low root mean square error (RMSE) between
the imputed and the true values. In addition, for some critical applications,
it is also often a requirement that the imputation method is scalable and the
logic behind the imputation is explainable, which is especially difficult for
complex methods that are, for example, based on deep learning. Based on these
considerations, we propose a new algorithm named "conditional
Distribution-based Imputation of Missing Values with Regularization" (DIMV).
DIMV operates by determining the conditional distribution of a feature that has
missing entries, using the information from the fully observed features as a
basis. As will be illustrated via experiments in the paper, DIMV (i) gives a
low RMSE for the imputed values compared to state-of-the-art methods; (ii) fast
and scalable; (iii) is explainable as coefficients in a regression model,
allowing reliable and trustable analysis, makes it a suitable choice for
critical domains where understanding is important such as in medical fields,
finance, etc; (iv) can provide an approximated confidence region for the
missing values in a given sample; (v) suitable for both small and large scale
data; (vi) in many scenarios, does not require a huge number of parameters as
deep learning approaches; (vii) handle multicollinearity in imputation
effectively; and (viii) is robust to the normally distributed assumption that
its theoretical grounds rely on.
Related papers
- A Targeted Accuracy Diagnostic for Variational Approximations [8.969208467611896]
Variational Inference (VI) is an attractive alternative to Markov Chain Monte Carlo (MCMC)
Existing methods characterize the quality of the whole variational distribution.
We propose the TArgeted Diagnostic for Distribution Approximation Accuracy (TADDAA)
arXiv Detail & Related papers (2023-02-24T02:50:18Z) - Validation Diagnostics for SBI algorithms based on Normalizing Flows [55.41644538483948]
This work proposes easy to interpret validation diagnostics for multi-dimensional conditional (posterior) density estimators based on NF.
It also offers theoretical guarantees based on results of local consistency.
This work should help the design of better specified models or drive the development of novel SBI-algorithms.
arXiv Detail & Related papers (2022-11-17T15:48:06Z) - Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions.
In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data.
We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z) - Multiple Imputation via Generative Adversarial Network for
High-dimensional Blockwise Missing Value Problems [6.123324869194195]
We propose Multiple Imputation via Generative Adversarial Network (MI-GAN), a deep learning-based (in specific, a GAN-based) multiple imputation method.
MI-GAN shows strong performance matching existing state-of-the-art imputation methods on high-dimensional datasets.
In particular, MI-GAN significantly outperforms other imputation methods in the sense of statistical inference and computational speed.
arXiv Detail & Related papers (2021-12-21T20:19:37Z) - RIFLE: Imputation and Robust Inference from Low Order Marginals [10.082738539201804]
We develop a statistical inference framework for regression and classification in the presence of missing data without imputation.
Our framework, RIFLE, estimates low-order moments of the underlying data distribution with corresponding confidence intervals to learn a distributionally robust model.
Our experiments demonstrate that RIFLE outperforms other benchmark algorithms when the percentage of missing values is high and/or when the number of data points is relatively small.
arXiv Detail & Related papers (2021-09-01T23:17:30Z) - Meta-Learning for Relative Density-Ratio Estimation [59.75321498170363]
Existing methods for (relative) density-ratio estimation (DRE) require many instances from both densities.
We propose a meta-learning method for relative DRE, which estimates the relative density-ratio from a few instances by using knowledge in related datasets.
We empirically demonstrate the effectiveness of the proposed method by using three problems: relative DRE, dataset comparison, and outlier detection.
arXiv Detail & Related papers (2021-07-02T02:13:45Z) - Scalable Marginal Likelihood Estimation for Model Selection in Deep
Learning [78.83598532168256]
Marginal-likelihood based model-selection is rarely used in deep learning due to estimation difficulties.
Our work shows that marginal likelihoods can improve generalization and be useful when validation data is unavailable.
arXiv Detail & Related papers (2021-04-11T09:50:24Z) - Accurate and Robust Feature Importance Estimation under Distribution
Shifts [49.58991359544005]
PRoFILE is a novel feature importance estimation method.
We show significant improvements over state-of-the-art approaches, both in terms of fidelity and robustness.
arXiv Detail & Related papers (2020-09-30T05:29:01Z) - ELMV: an Ensemble-Learning Approach for Analyzing Electrical Health
Records with Significant Missing Values [4.9810955364960385]
We propose a novel Ensemble-Learning for Missing Value (ELMV) framework, which introduces an effective approach to construct multiple subsets of the original EHR data with a much lower missing rate.
ELMV has been evaluated on a real-world healthcare data for critical feature identification as well as a batch of simulation data with different missing rates for outcome prediction.
arXiv Detail & Related papers (2020-06-25T06:29:55Z) - MissDeepCausal: Causal Inference from Incomplete Data Using Deep Latent
Variable Models [14.173184309520453]
State-of-the-art methods for causal inference don't consider missing values.
Missing data require an adapted unconfoundedness hypothesis.
Latent confounders whose distribution is learned through variational autoencoders adapted to missing values are considered.
arXiv Detail & Related papers (2020-02-25T12:58:07Z) - Localized Debiased Machine Learning: Efficient Inference on Quantile
Treatment Effects and Beyond [69.83813153444115]
We consider an efficient estimating equation for the (local) quantile treatment effect ((L)QTE) in causal inference.
Debiased machine learning (DML) is a data-splitting approach to estimating high-dimensional nuisances.
We propose localized debiased machine learning (LDML), which avoids this burdensome step.
arXiv Detail & Related papers (2019-12-30T14:42:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.