Related papers: Identifiable Generative Models for Missing Not at Random Data Imputation

Identifiable Generative Models for Missing Not at Random Data Imputation

URL: http://arxiv.org/abs/2110.14708v1
Date: Wed, 27 Oct 2021 18:51:38 GMT
Title: Identifiable Generative Models for Missing Not at Random Data Imputation
Authors: Chao Ma and Cheng Zhang
Abstract summary: Many imputation methods do not take into account the missingness mechanism, resulting in biased imputation values when MNAR data is present. In this work, we analyze the identifiability of generative models under MNAR. We propose a practical deep generative model which can provide identifiability guarantees under mild assumptions.
Score: 13.790820495804567
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Real-world datasets often have missing values associated with complex generative processes, where the cause of the missingness may not be fully observed. This is known as missing not at random (MNAR) data. However, many imputation methods do not take into account the missingness mechanism, resulting in biased imputation values when MNAR data is present. Although there are a few methods that have considered the MNAR scenario, their model's identifiability under MNAR is generally not guaranteed. That is, model parameters can not be uniquely determined even with infinite data samples, hence the imputation results given by such models can still be biased. This issue is especially overlooked by many modern deep generative models. In this work, we fill in this gap by systematically analyzing the identifiability of generative models under MNAR. Furthermore, we propose a practical deep generative model which can provide identifiability guarantees under mild assumptions, for a wide range of MNAR mechanisms. Our method demonstrates a clear advantage for tasks on both synthetic data and multiple real-world scenarios with MNAR data.

Related papers

Recursive Equations For Imputation Of Missing Not At Random Data With Sparse Pattern Support [8.863778901027061]
A common approach for handling missing values in data analysis pipelines is multiple imputation via software packages.<n>We develop a new characterization for the full data law in graphical models of missing data.<n>We show MISPR obtains comparable results to MICE when data are MAR, and superior, less biased results when data are MNAR.
arXiv Detail & Related papers (2025-07-21T23:18:36Z)
Deep Generative Imputation Model for Missing Not At Random Data [13.56794299885683]
We exploit a deep generative imputation model, namely GNR, to process the real-world missing mechanism in the latent space. The experimental results show that our GNR surpasses state-of-the-art MNAR baselines with significant margins.
arXiv Detail & Related papers (2023-08-16T06:01:12Z)
Sufficient Identification Conditions and Semiparametric Estimation under Missing Not at Random Mechanisms [4.211128681972148]
Conducting valid statistical analyses is challenging in the presence of missing-not-at-random (MNAR) data. We consider a MNAR model that generalizes several prior popular MNAR models in two ways. We propose methods for testing the independence restrictions encoded in such models using odds ratio as our parameter of interest.
arXiv Detail & Related papers (2023-06-10T13:46:16Z)
Learning from aggregated data with a maximum entropy model [73.63512438583375]
We show how a new model, similar to a logistic regression, may be learned from aggregated data only by approximating the unobserved feature distribution with a maximum entropy hypothesis. We present empirical evidence on several public datasets that the model learned this way can achieve performances comparable to those of a logistic model trained with the full unaggregated data.
arXiv Detail & Related papers (2022-10-05T09:17:27Z)
MissDAG: Causal Discovery in the Presence of Missing Data with Continuous Additive Noise Models [78.72682320019737]
We develop a general method, which we call MissDAG, to perform causal discovery from data with incomplete observations. MissDAG maximizes the expected likelihood of the visible part of observations under the expectation-maximization framework. We demonstrate the flexibility of MissDAG for incorporating various causal discovery algorithms and its efficacy through extensive simulations and real data experiments.
arXiv Detail & Related papers (2022-05-27T09:59:46Z)
Learning Hidden Markov Models When the Locations of Missing Observations are Unknown [54.40592050737724]
We consider the general problem of learning an HMM from data with unknown missing observation locations. We provide reconstruction algorithms that do not require any assumptions about the structure of the underlying chain. We show that under proper specifications one can reconstruct the process dynamics as well as if the missing observations positions were known.
arXiv Detail & Related papers (2022-03-12T22:40:43Z)
Model-based Clustering with Missing Not At Random Data [0.8777702580252754]
We propose model-based clustering algorithms designed to handle very general types of missing data, including MNAR data. Several MNAR models are discussed, for which the cause of the missingness can depend on both the values of the missing variable themselves and on the class membership. We focus on a specific MNAR model, called MNARz, for which the missingness only depends on the class membership.
arXiv Detail & Related papers (2021-12-20T09:52:12Z)
MIRACLE: Causally-Aware Imputation via Learning Missing Data Mechanisms [82.90843777097606]
We propose a causally-aware imputation algorithm (MIRACLE) for missing data. MIRACLE iteratively refines the imputation of a baseline by simultaneously modeling the missingness generating mechanism. We conduct extensive experiments on synthetic and a variety of publicly available datasets to show that MIRACLE is able to consistently improve imputation.
arXiv Detail & Related papers (2021-11-04T22:38:18Z)
Deep Generative Pattern-Set Mixture Models for Nonignorable Missingness [0.0]
We propose a variational autoencoder architecture to model both ignorable and nonignorable missing data. Our model explicitly learns to cluster the missing data into missingness pattern sets based on the observed data and missingness masks. Our setup trades off the characteristics of ignorable and nonignorable missingness and can thus be applied to data of both types.
arXiv Detail & Related papers (2021-03-05T08:21:35Z)
Robust Finite Mixture Regression for Heterogeneous Targets [70.19798470463378]
We propose an FMR model that finds sample clusters and jointly models multiple incomplete mixed-type targets simultaneously. We provide non-asymptotic oracle performance bounds for our model under a high-dimensional learning framework. The results show that our model can achieve state-of-the-art performance.
arXiv Detail & Related papers (2020-10-12T03:27:07Z)
Data from Model: Extracting Data from Non-robust and Robust Models [83.60161052867534]
This work explores the reverse process of generating data from a model, attempting to reveal the relationship between the data and the model. We repeat the process of Data to Model (DtM) and Data from Model (DfM) in sequence and explore the loss of feature mapping information. Our results show that the accuracy drop is limited even after multiple sequences of DtM and DfM, especially for robust models.
arXiv Detail & Related papers (2020-07-13T05:27:48Z)
VAEs in the Presence of Missing Data [6.397263087026567]
We develop a novel latent variable model of a corruption process which generates missing data, and derive a corresponding tractable evidence lower bound (ELBO) Our model is straightforward to implement, can handle both missing completely at random (MCAR) and missing not at random (MNAR) data, scales to high dimensional inputs and gives both the VAE encoder and decoder access to indicator variables for whether a data element is missing or not. On the MNIST and SVHN datasets we demonstrate improved marginal log-likelihood of observed data and better missing data imputation, compared to existing approaches.
arXiv Detail & Related papers (2020-06-09T14:40:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.