Model-based Clustering with Missing Not At Random Data
- URL: http://arxiv.org/abs/2112.10425v4
- Date: Fri, 22 Dec 2023 08:45:34 GMT
- Title: Model-based Clustering with Missing Not At Random Data
- Authors: Aude Sportisse (UCA, MAASAI), Matthieu Marbac (UR, ENSAI, CNRS,
CREST), Fabien Laporte (Nantes Univ, CNRS, ITX-lab), Gilles Celeux (CELESTE),
Claire Boyer (SU, LPSM (UMR\_8001), MOKAPLAN), Julie Josse (IDESP,
PREMEDICAL), Christophe Biernacki (CNRS, MODAL)
- Abstract summary: We propose model-based clustering algorithms designed to handle very general types of missing data, including MNAR data.
Several MNAR models are discussed, for which the cause of the missingness can depend on both the values of the missing variable themselves and on the class membership.
We focus on a specific MNAR model, called MNARz, for which the missingness only depends on the class membership.
- Score: 0.8777702580252754
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Model-based unsupervised learning, as any learning task, stalls as soon as
missing data occurs. This is even more true when the missing data are
informative, or said missing not at random (MNAR). In this paper, we propose
model-based clustering algorithms designed to handle very general types of
missing data, including MNAR data. To do so, we introduce a mixture model for
different types of data (continuous, count, categorical and mixed) to jointly
model the data distribution and the MNAR mechanism, remaining vigilant to the
relative degrees of freedom of each. Several MNAR models are discussed, for
which the cause of the missingness can depend on both the values of the missing
variable themselves and on the class membership. However, we focus on a
specific MNAR model, called MNARz, for which the missingness only depends on
the class membership. We first underline its ease of estimation, by showing
that the statistical inference can be carried out on the data matrix
concatenated with the missing mask considering finally a standard MAR
mechanism. Consequently, we propose to perform clustering using the Expectation
Maximization algorithm, specially developed for this simplified
reinterpretation. Finally, we assess the numerical performances of the proposed
methods on synthetic data and on the real medical registry TraumaBase as well.
Related papers
- Attribute-to-Delete: Machine Unlearning via Datamodel Matching [65.13151619119782]
Machine unlearning -- efficiently removing a small "forget set" training data on a pre-divertrained machine learning model -- has recently attracted interest.
Recent research shows that machine unlearning techniques do not hold up in such a challenging setting.
arXiv Detail & Related papers (2024-10-30T17:20:10Z) - Deep Generative Imputation Model for Missing Not At Random Data [13.56794299885683]
We exploit a deep generative imputation model, namely GNR, to process the real-world missing mechanism in the latent space.
The experimental results show that our GNR surpasses state-of-the-art MNAR baselines with significant margins.
arXiv Detail & Related papers (2023-08-16T06:01:12Z) - SequenceMatch: Imitation Learning for Autoregressive Sequence Modelling with Backtracking [60.109453252858806]
A maximum-likelihood (MLE) objective does not match a downstream use-case of autoregressively generating high-quality sequences.
We formulate sequence generation as an imitation learning (IL) problem.
This allows us to minimize a variety of divergences between the distribution of sequences generated by an autoregressive model and sequences from a dataset.
Our resulting method, SequenceMatch, can be implemented without adversarial training or architectural changes.
arXiv Detail & Related papers (2023-06-08T17:59:58Z) - Learning from aggregated data with a maximum entropy model [73.63512438583375]
We show how a new model, similar to a logistic regression, may be learned from aggregated data only by approximating the unobserved feature distribution with a maximum entropy hypothesis.
We present empirical evidence on several public datasets that the model learned this way can achieve performances comparable to those of a logistic model trained with the full unaggregated data.
arXiv Detail & Related papers (2022-10-05T09:17:27Z) - MIRACLE: Causally-Aware Imputation via Learning Missing Data Mechanisms [82.90843777097606]
We propose a causally-aware imputation algorithm (MIRACLE) for missing data.
MIRACLE iteratively refines the imputation of a baseline by simultaneously modeling the missingness generating mechanism.
We conduct extensive experiments on synthetic and a variety of publicly available datasets to show that MIRACLE is able to consistently improve imputation.
arXiv Detail & Related papers (2021-11-04T22:38:18Z) - Identifiable Generative Models for Missing Not at Random Data Imputation [13.790820495804567]
Many imputation methods do not take into account the missingness mechanism, resulting in biased imputation values when MNAR data is present.
In this work, we analyze the identifiability of generative models under MNAR.
We propose a practical deep generative model which can provide identifiability guarantees under mild assumptions.
arXiv Detail & Related papers (2021-10-27T18:51:38Z) - Deep Generative Pattern-Set Mixture Models for Nonignorable Missingness [0.0]
We propose a variational autoencoder architecture to model both ignorable and nonignorable missing data.
Our model explicitly learns to cluster the missing data into missingness pattern sets based on the observed data and missingness masks.
Our setup trades off the characteristics of ignorable and nonignorable missingness and can thus be applied to data of both types.
arXiv Detail & Related papers (2021-03-05T08:21:35Z) - Learning from missing data with the Latent Block Model [0.5735035463793007]
We propose a co-clustering model, based on the Latent Block Model, that aims to take advantage of Missing Not At Random data.
A variational expectation-maximization algorithm is derived to perform inference and a model selection criterion is presented.
arXiv Detail & Related papers (2020-10-23T08:11:43Z) - Robust Finite Mixture Regression for Heterogeneous Targets [70.19798470463378]
We propose an FMR model that finds sample clusters and jointly models multiple incomplete mixed-type targets simultaneously.
We provide non-asymptotic oracle performance bounds for our model under a high-dimensional learning framework.
The results show that our model can achieve state-of-the-art performance.
arXiv Detail & Related papers (2020-10-12T03:27:07Z) - Model Fusion with Kullback--Leibler Divergence [58.20269014662046]
We propose a method to fuse posterior distributions learned from heterogeneous datasets.
Our algorithm relies on a mean field assumption for both the fused model and the individual dataset posteriors.
arXiv Detail & Related papers (2020-07-13T03:27:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.