Bayesian data combination model with Gaussian process latent variable
model for mixed observed variables under NMAR missingness
- URL: http://arxiv.org/abs/2109.00462v1
- Date: Wed, 1 Sep 2021 16:09:55 GMT
- Title: Bayesian data combination model with Gaussian process latent variable
model for mixed observed variables under NMAR missingness
- Authors: Masaki Mitsuhiro, Takahiro Hoshino
- Abstract summary: It is difficult to obtain a "(quasi) single-source dataset" in which the variables of interest are simultaneously observed.
It is necessary to utilize these datasets as a single-source dataset with missing variables.
We propose a data fusion method that does not assume that datasets are homogenous.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the analysis of observational data in social sciences and businesses, it
is difficult to obtain a "(quasi) single-source dataset" in which the variables
of interest are simultaneously observed. Instead, multiple-source datasets are
typically acquired for different individuals or units. Various methods have
been proposed to investigate the relationship between the variables in each
dataset, e.g., matching and latent variable modeling. It is necessary to
utilize these datasets as a single-source dataset with missing variables.
Existing methods assume that the datasets to be integrated are acquired from
the same population or that the sampling depends on covariates. This assumption
is referred to as missing at random (MAR) in terms of missingness. However, as
will been shown in application studies, it is likely that this assumption does
not hold in actual data analysis and the results obtained may be biased. We
propose a data fusion method that does not assume that datasets are homogenous.
We use a Gaussian process latent variable model for non-MAR missing data. This
model assumes that the variables of concern and the probability of being
missing depend on latent variables. A simulation study and real-world data
analysis show that the proposed method with a missing-data mechanism and the
latent Gaussian process yields valid estimates, whereas an existing method
provides severely biased estimates. This is the first study in which non-random
assignment to datasets is considered and resolved under resonable assumptions
in data fusion problem.
Related papers
- Learning to Bound Counterfactual Inference in Structural Causal Models
from Observational and Randomised Data [64.96984404868411]
We derive a likelihood characterisation for the overall data that leads us to extend a previous EM-based algorithm.
The new algorithm learns to approximate the (unidentifiability) region of model parameters from such mixed data sources.
It delivers interval approximations to counterfactual results, which collapse to points in the identifiable case.
arXiv Detail & Related papers (2022-12-06T12:42:11Z) - Learning from aggregated data with a maximum entropy model [73.63512438583375]
We show how a new model, similar to a logistic regression, may be learned from aggregated data only by approximating the unobserved feature distribution with a maximum entropy hypothesis.
We present empirical evidence on several public datasets that the model learned this way can achieve performances comparable to those of a logistic model trained with the full unaggregated data.
arXiv Detail & Related papers (2022-10-05T09:17:27Z) - A Graphical Model for Fusing Diverse Microbiome Data [2.385985842958366]
We introduce a flexible multinomial-Gaussian generative model for jointly modeling such count data.
We present a computationally scalable variational Expectation-Maximization (EM) algorithm for inferring the latent variables and the parameters of the model.
arXiv Detail & Related papers (2022-08-21T17:54:39Z) - Equivariance Discovery by Learned Parameter-Sharing [153.41877129746223]
We study how to discover interpretable equivariances from data.
Specifically, we formulate this discovery process as an optimization problem over a model's parameter-sharing schemes.
Also, we theoretically analyze the method for Gaussian data and provide a bound on the mean squared gap between the studied discovery scheme and the oracle scheme.
arXiv Detail & Related papers (2022-04-07T17:59:19Z) - Equivariance Allows Handling Multiple Nuisance Variables When Analyzing
Pooled Neuroimaging Datasets [53.34152466646884]
In this paper, we show how bringing recent results on equivariant representation learning instantiated on structured spaces together with simple use of classical results on causal inference provides an effective practical solution.
We demonstrate how our model allows dealing with more than one nuisance variable under some assumptions and can enable analysis of pooled scientific datasets in scenarios that would otherwise entail removing a large portion of the samples.
arXiv Detail & Related papers (2022-03-29T04:54:06Z) - A Robust and Flexible EM Algorithm for Mixtures of Elliptical
Distributions with Missing Data [71.9573352891936]
This paper tackles the problem of missing data imputation for noisy and non-Gaussian data.
A new EM algorithm is investigated for mixtures of elliptical distributions with the property of handling potential missing data.
Experimental results on synthetic data demonstrate that the proposed algorithm is robust to outliers and can be used with non-Gaussian data.
arXiv Detail & Related papers (2022-01-28T10:01:37Z) - Multimodal Data Fusion in High-Dimensional Heterogeneous Datasets via
Generative Models [16.436293069942312]
We are interested in learning probabilistic generative models from high-dimensional heterogeneous data in an unsupervised fashion.
We propose a general framework that combines disparate data types through the exponential family of distributions.
The proposed algorithm is presented in detail for the commonly encountered heterogeneous datasets with real-valued (Gaussian) and categorical (multinomial) features.
arXiv Detail & Related papers (2021-08-27T18:10:31Z) - Variational Selective Autoencoder: Learning from Partially-Observed
Heterogeneous Data [45.23338389559936]
We propose the variational selective autoencoder (VSAE) to learn representations from partially-observed heterogeneous data.
VSAE learns the latent dependencies in heterogeneous data by modeling the joint distribution of observed data, unobserved data, and the imputation mask.
It results in a unified model for various downstream tasks including data generation and imputation.
arXiv Detail & Related papers (2021-02-25T04:39:13Z) - The UU-test for Statistical Modeling of Unimodal Data [0.20305676256390928]
We propose a technique called UU-test (Unimodal Uniform test) to decide on the unimodality of a one-dimensional dataset.
A unique feature of this approach is that in the case of unimodality, it also provides a statistical model of the data in the form of a Uniform Mixture Model.
arXiv Detail & Related papers (2020-08-28T08:34:28Z) - Tell Me Something I Don't Know: Randomization Strategies for Iterative
Data Mining [0.6100370338020054]
We consider the problem of randomizing data so that previously discovered patterns or models are taken into account.
In this paper, we consider the problem of randomizing data so that previously discovered patterns or models are taken into account.
arXiv Detail & Related papers (2020-06-16T19:20:50Z) - Stable Prediction via Leveraging Seed Variable [73.9770220107874]
Previous machine learning methods might exploit subtly spurious correlations in training data induced by non-causal variables for prediction.
We propose a conditional independence test based algorithm to separate causal variables with a seed variable as priori, and adopt them for stable prediction.
Our algorithm outperforms state-of-the-art methods for stable prediction.
arXiv Detail & Related papers (2020-06-09T06:56:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.