Zero-inflation in the Multivariate Poisson Lognormal Family
- URL: http://arxiv.org/abs/2405.14711v2
- Date: Fri, 12 Sep 2025 12:17:25 GMT
- Title: Zero-inflation in the Multivariate Poisson Lognormal Family
- Authors: Bastien Batardière, Julien Chiquet, François Gindraud, Mahendra Mariadassou,
- Abstract summary: We introduce the Zero-Inflated PLN model, adding a multivariate zero-inflated component to the model, as an additional Bernoulli latent variable.<n>We estimate model parameters using variational inference that scales up to datasets with a few thousands variables.<n>We then apply both ZIPLN and PLN to a cow microbiome dataset, containing 90.6% of zeroes.
- Score: 0.5249805590164902
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Analyzing high-dimensional count data is a challenge and statistical model-based approaches provide an adequate and efficient framework that preserves explainability. The (multivariate) Poisson-Log-Normal (PLN) model is one such model: it assumes count data are driven by an underlying structured latent Gaussian variable, so that the dependencies between counts solely stems from the latent dependencies. However PLN doesn't account for zero-inflation, a feature frequently observed in real-world datasets. Here we introduce the Zero-Inflated PLN (ZIPLN) model, adding a multivariate zero-inflated component to the model, as an additional Bernoulli latent variable. The Zero-Inflation can be fixed, site-specific, feature-specific or depends on covariates. We estimate model parameters using variational inference that scales up to datasets with a few thousands variables and compare two approximations: (i) independent Gaussian and Bernoulli variational distributions or (ii) Gaussian variational distribution conditioned on the Bernoulli one. The method is assessed on synthetic data and the efficiency of ZIPLN is established even when zero-inflation concerns up to 90% of the observed counts. We then apply both ZIPLN and PLN to a cow microbiome dataset, containing 90.6% of zeroes. Accounting for zero-inflation significantly increases log-likelihood and reduces dispersion in the latent space, thus leading to improved group discrimination.
Related papers
- Data-Driven Information-Theoretic Causal Bounds under Unmeasured Confounding [10.590231532335691]
We develop a data-driven information-theoretic framework for partial identification of conditional causal effects under unmeasured confounding.<n>Our key theoretical contribution shows that the f-divergence between the observational distribution P(Y | A = a, X = x) and the interventional distribution P(Y | do(A = a), X = x) is upper bounded by a function of the propensity score alone.<n>This result enables sharp partial identification of conditional causal effects directly from observational data, without requiring external sensitivity parameters, auxiliary variables, full structural specifications, or outcome boundedness assumptions.
arXiv Detail & Related papers (2026-01-23T20:47:48Z) - Efficient Covariance Estimation for Sparsified Functional Data [51.69796254617083]
proposed Random-knots (Random-knots-Spatial) and B-spline (Bspline-Spatial) estimators of the covariance function are computationally efficient.<n>Asymptotic pointwise of the covariance are obtained for sparsified individual trajectories under some regularity conditions.
arXiv Detail & Related papers (2025-11-23T00:50:33Z) - Spurious Correlations in High Dimensional Regression: The Roles of Regularization, Simplicity Bias and Over-Parameterization [19.261178173399784]
Learning models have been shown to rely on spurious correlations between non-predictive features and the associated labels in the training data.
We quantify the amount of spurious correlations $C$ learned via linear regression, in terms of the data covariance and the strength $lambda$ of the ridge regularization.
arXiv Detail & Related papers (2025-02-03T13:38:42Z) - Non-asymptotic bounds for forward processes in denoising diffusions: Ornstein-Uhlenbeck is hard to beat [49.1574468325115]
This paper presents explicit non-asymptotic bounds on the forward diffusion error in total variation (TV)
We parametrise multi-modal data distributions in terms of the distance $R$ to their furthest modes and consider forward diffusions with additive and multiplicative noise.
arXiv Detail & Related papers (2024-08-25T10:28:31Z) - Towards Faster Non-Asymptotic Convergence for Diffusion-Based Generative
Models [49.81937966106691]
We develop a suite of non-asymptotic theory towards understanding the data generation process of diffusion models.
In contrast to prior works, our theory is developed based on an elementary yet versatile non-asymptotic approach.
arXiv Detail & Related papers (2023-06-15T16:30:08Z) - Copula-Based Density Estimation Models for Multivariate Zero-Inflated
Continuous Data [0.0]
We propose two copula-based density estimation models that can cope with multivariate correlation among zero-inflated continuous variables.
In order to overcome the difficulty in the use of copulas due to the tied-data problem in zero-inflated data, we propose a new type of copula, rectified Gaussian copula.
arXiv Detail & Related papers (2023-04-02T13:43:37Z) - On counterfactual inference with unobserved confounding [36.18241676876348]
Given an observational study with $n$ independent but heterogeneous units, our goal is to learn the counterfactual distribution for each unit.
We introduce a convex objective that pools all $n$ samples to jointly learn all $n$ parameter vectors.
We derive sufficient conditions for compactly supported distributions to satisfy the logarithmic Sobolev inequality.
arXiv Detail & Related papers (2022-11-14T04:14:37Z) - Break The Spell Of Total Correlation In betaTCVAE [4.38301148531795]
This paper proposes a new iterative decomposition path of total correlation and explains the disentangled representation ability of VAE.
The novel model enables VAE to adjust the parameter capacity to divide dependent and independent data features flexibly.
arXiv Detail & Related papers (2022-10-17T07:16:53Z) - Learning Multivariate CDFs and Copulas using Tensor Factorization [39.24470798045442]
Learning the multivariate distribution of data is a core challenge in statistics and machine learning.
In this work, we aim to learn multivariate cumulative distribution functions (CDFs), as they can handle mixed random variables.
We show that any grid sampled version of a joint CDF of mixed random variables admits a universal representation as a naive Bayes model.
We demonstrate the superior performance of the proposed model in several synthetic and real datasets and applications including regression, sampling and data imputation.
arXiv Detail & Related papers (2022-10-13T16:18:46Z) - Learning from aggregated data with a maximum entropy model [73.63512438583375]
We show how a new model, similar to a logistic regression, may be learned from aggregated data only by approximating the unobserved feature distribution with a maximum entropy hypothesis.
We present empirical evidence on several public datasets that the model learned this way can achieve performances comparable to those of a logistic model trained with the full unaggregated data.
arXiv Detail & Related papers (2022-10-05T09:17:27Z) - Wasserstein Distributional Learning [5.830831796910439]
Wasserstein Distributional Learning (WDL) is a flexible density-on-scalar regression modeling framework.
We show that WDL better characterizes and uncovers the nonlinear dependence of the conditional densities.
We demonstrate the effectiveness of the WDL framework through simulations and real-world applications.
arXiv Detail & Related papers (2022-09-12T02:32:17Z) - $p$-Generalized Probit Regression and Scalable Maximum Likelihood
Estimation via Sketching and Coresets [74.37849422071206]
We study the $p$-generalized probit regression model, which is a generalized linear model for binary responses.
We show how the maximum likelihood estimator for $p$-generalized probit regression can be approximated efficiently up to a factor of $(1+varepsilon)$ on large data.
arXiv Detail & Related papers (2022-03-25T10:54:41Z) - Understanding the Under-Coverage Bias in Uncertainty Estimation [58.03725169462616]
quantile regression tends to emphunder-cover than the desired coverage level in reality.
We prove that quantile regression suffers from an inherent under-coverage bias.
Our theory reveals that this under-coverage bias stems from a certain high-dimensional parameter estimation error.
arXiv Detail & Related papers (2021-06-10T06:11:55Z) - Probabilistic Circuits for Variational Inference in Discrete Graphical
Models [101.28528515775842]
Inference in discrete graphical models with variational methods is difficult.
Many sampling-based methods have been proposed for estimating Evidence Lower Bound (ELBO)
We propose a new approach that leverages the tractability of probabilistic circuit models, such as Sum Product Networks (SPN)
We show that selective-SPNs are suitable as an expressive variational distribution, and prove that when the log-density of the target model is aweighted the corresponding ELBO can be computed analytically.
arXiv Detail & Related papers (2020-10-22T05:04:38Z) - Accounting for Unobserved Confounding in Domain Generalization [107.0464488046289]
This paper investigates the problem of learning robust, generalizable prediction models from a combination of datasets.
Part of the challenge of learning robust models lies in the influence of unobserved confounders.
We demonstrate the empirical performance of our approach on healthcare data from different modalities.
arXiv Detail & Related papers (2020-07-21T08:18:06Z) - Instability, Computational Efficiency and Statistical Accuracy [101.32305022521024]
We develop a framework that yields statistical accuracy based on interplay between the deterministic convergence rate of the algorithm at the population level, and its degree of (instability) when applied to an empirical object based on $n$ samples.
We provide applications of our general results to several concrete classes of models, including Gaussian mixture estimation, non-linear regression models, and informative non-response models.
arXiv Detail & Related papers (2020-05-22T22:30:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.