VAEs in the Presence of Missing Data
- URL: http://arxiv.org/abs/2006.05301v3
- Date: Sun, 21 Mar 2021 11:42:08 GMT
- Title: VAEs in the Presence of Missing Data
- Authors: Mark Collier, Alfredo Nazabal and Christopher K.I. Williams
- Abstract summary: We develop a novel latent variable model of a corruption process which generates missing data, and derive a corresponding tractable evidence lower bound (ELBO)
Our model is straightforward to implement, can handle both missing completely at random (MCAR) and missing not at random (MNAR) data, scales to high dimensional inputs and gives both the VAE encoder and decoder access to indicator variables for whether a data element is missing or not.
On the MNIST and SVHN datasets we demonstrate improved marginal log-likelihood of observed data and better missing data imputation, compared to existing approaches.
- Score: 6.397263087026567
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Real world datasets often contain entries with missing elements e.g. in a
medical dataset, a patient is unlikely to have taken all possible diagnostic
tests. Variational Autoencoders (VAEs) are popular generative models often used
for unsupervised learning. Despite their widespread use it is unclear how best
to apply VAEs to datasets with missing data. We develop a novel latent variable
model of a corruption process which generates missing data, and derive a
corresponding tractable evidence lower bound (ELBO). Our model is
straightforward to implement, can handle both missing completely at random
(MCAR) and missing not at random (MNAR) data, scales to high dimensional inputs
and gives both the VAE encoder and decoder principled access to indicator
variables for whether a data element is missing or not. On the MNIST and SVHN
datasets we demonstrate improved marginal log-likelihood of observed data and
better missing data imputation, compared to existing approaches.
Related papers
- Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - The Missing Indicator Method: From Low to High Dimensions [16.899237833310064]
Missing data is common in applied data science, particularly in healthcare, social sciences, and natural sciences.
For data sets with informative missing patterns, the Missing Indicator Method (MIM) can be used in conjunction with imputation to improve model performance.
We show experimentally that MIM improves performance for informative missing values, and we prove that MIM does not hurt linear models for uninformative missing values.
We introduce Selective MIM, a method that adds missing indicators only for features that have informative missing patterns.
arXiv Detail & Related papers (2022-11-16T23:10:45Z) - Learning from aggregated data with a maximum entropy model [73.63512438583375]
We show how a new model, similar to a logistic regression, may be learned from aggregated data only by approximating the unobserved feature distribution with a maximum entropy hypothesis.
We present empirical evidence on several public datasets that the model learned this way can achieve performances comparable to those of a logistic model trained with the full unaggregated data.
arXiv Detail & Related papers (2022-10-05T09:17:27Z) - Leveraging variational autoencoders for multiple data imputation [0.5156484100374059]
We investigate the ability of deep models, namely variational autoencoders (VAEs), to account for uncertainty in missing data through multiple imputation strategies.
We find that VAEs provide poor empirical coverage of missing data, with underestimation and overconfident imputations.
To overcome this, we employ $beta$-VAEs, which viewed from a generalized Bayes framework, provide robustness to model misspecification.
arXiv Detail & Related papers (2022-09-30T08:58:43Z) - MissDAG: Causal Discovery in the Presence of Missing Data with
Continuous Additive Noise Models [78.72682320019737]
We develop a general method, which we call MissDAG, to perform causal discovery from data with incomplete observations.
MissDAG maximizes the expected likelihood of the visible part of observations under the expectation-maximization framework.
We demonstrate the flexibility of MissDAG for incorporating various causal discovery algorithms and its efficacy through extensive simulations and real data experiments.
arXiv Detail & Related papers (2022-05-27T09:59:46Z) - MURAL: An Unsupervised Random Forest-Based Embedding for Electronic
Health Record Data [59.26381272149325]
We present an unsupervised random forest for representing data with disparate variable types.
MURAL forests consist of a set of decision trees where node-splitting variables are chosen at random.
We show that using our approach, we can visualize and classify data more accurately than competing approaches.
arXiv Detail & Related papers (2021-11-19T22:02:21Z) - Identifiable Generative Models for Missing Not at Random Data Imputation [13.790820495804567]
Many imputation methods do not take into account the missingness mechanism, resulting in biased imputation values when MNAR data is present.
In this work, we analyze the identifiability of generative models under MNAR.
We propose a practical deep generative model which can provide identifiability guarantees under mild assumptions.
arXiv Detail & Related papers (2021-10-27T18:51:38Z) - Imputation-Free Learning from Incomplete Observations [73.15386629370111]
We introduce the importance of guided gradient descent (IGSGD) method to train inference from inputs containing missing values without imputation.
We employ reinforcement learning (RL) to adjust the gradients used to train the models via back-propagation.
Our imputation-free predictions outperform the traditional two-step imputation-based predictions using state-of-the-art imputation methods.
arXiv Detail & Related papers (2021-07-05T12:44:39Z) - Discriminative-Generative Dual Memory Video Anomaly Detection [81.09977516403411]
Recently, people tried to use a few anomalies for video anomaly detection (VAD) instead of only normal data during the training process.
We propose a DiscRiminative-gEnerative duAl Memory (DREAM) anomaly detection model to take advantage of a few anomalies and solve data imbalance.
arXiv Detail & Related papers (2021-04-29T15:49:01Z) - Medical data wrangling with sequential variational autoencoders [5.9207487081080705]
This paper proposes to model medical data records with heterogeneous data types and bursty missing data using sequential variational autoencoders (VAEs)
We show that Shi-VAE achieves the best performance in terms of using both metrics, with lower computational complexity than the GP-VAE model.
arXiv Detail & Related papers (2021-03-12T10:59:26Z) - Multiple Imputation with Denoising Autoencoder using Metamorphic Truth
and Imputation Feedback [0.0]
We propose a Multiple Imputation model using Denoising Autoencoders to learn the internal representation of data.
We use the novel mechanisms of Metamorphic Truth and Imputation Feedback to maintain statistical integrity of attributes.
Our approach explores the effects of imputation on various missingness mechanisms and patterns of missing data, outperforming other methods in many standard test cases.
arXiv Detail & Related papers (2020-02-19T18:26:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.