A new LDA formulation with covariates
- URL: http://arxiv.org/abs/2202.11527v1
- Date: Fri, 18 Feb 2022 19:58:24 GMT
- Title: A new LDA formulation with covariates
- Authors: Gilson Shimizu, Rafael Izbicki and Denis Valle
- Abstract summary: The Latent Dirichlet Allocation model is a popular method for creating mixed-membership clusters.
We propose a new formulation for the LDA model which incorporates covariates.
We use slice sampling within a Gibbs sampling algorithm to estimate model parameters.
The model is illustrated using real data sets from three different areas: text-mining of Coronavirus articles, analysis of grocery shopping baskets, and ecology of tree species on Barro Colorado Island (Panama)
- Score: 3.1690891866882236
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Latent Dirichlet Allocation (LDA) model is a popular method for creating
mixed-membership clusters. Despite having been originally developed for text
analysis, LDA has been used for a wide range of other applications. We propose
a new formulation for the LDA model which incorporates covariates. In this
model, a negative binomial regression is embedded within LDA, enabling
straight-forward interpretation of the regression coefficients and the analysis
of the quantity of cluster-specific elements in each sampling units (instead of
the analysis being focused on modeling the proportion of each cluster, as in
Structural Topic Models). We use slice sampling within a Gibbs sampling
algorithm to estimate model parameters. We rely on simulations to show how our
algorithm is able to successfully retrieve the true parameter values and the
ability to make predictions for the abundance matrix using the information
given by the covariates. The model is illustrated using real data sets from
three different areas: text-mining of Coronavirus articles, analysis of grocery
shopping baskets, and ecology of tree species on Barro Colorado Island
(Panama). This model allows the identification of mixed-membership clusters in
discrete data and provides inference on the relationship between covariates and
the abundance of these clusters.
Related papers
- Induced Covariance for Causal Discovery in Linear Sparse Structures [55.2480439325792]
Causal models seek to unravel the cause-effect relationships among variables from observed data.
This paper introduces a novel causal discovery algorithm designed for settings in which variables exhibit linearly sparse relationships.
arXiv Detail & Related papers (2024-10-02T04:01:38Z) - Finite Mixtures of Multivariate Poisson-Log Normal Factor Analyzers for
Clustering Count Data [0.8499685241219366]
A class of eight parsimonious mixture models based on the mixtures of factor analyzers model are introduced.
The proposed models are explored in the context of clustering discrete data arising from RNA sequencing studies.
arXiv Detail & Related papers (2023-11-13T21:23:15Z) - Learning from aggregated data with a maximum entropy model [73.63512438583375]
We show how a new model, similar to a logistic regression, may be learned from aggregated data only by approximating the unobserved feature distribution with a maximum entropy hypothesis.
We present empirical evidence on several public datasets that the model learned this way can achieve performances comparable to those of a logistic model trained with the full unaggregated data.
arXiv Detail & Related papers (2022-10-05T09:17:27Z) - Gaussian Process Koopman Mode Decomposition [5.888646114353371]
We propose a nonlinear probabilistic generative model of Koopman mode decomposition based on an unsupervised Gaussian process.
Applying the proposed model to both synthetic data and a real-world epidemiological dataset, we show that various analyses are available using the estimated parameters.
arXiv Detail & Related papers (2022-09-09T03:57:07Z) - Personalized Federated Learning via Convex Clustering [72.15857783681658]
We propose a family of algorithms for personalized federated learning with locally convex user costs.
The proposed framework is based on a generalization of convex clustering in which the differences between different users' models are penalized.
arXiv Detail & Related papers (2022-02-01T19:25:31Z) - Inverting brain grey matter models with likelihood-free inference: a
tool for trustable cytoarchitecture measurements [62.997667081978825]
characterisation of the brain grey matter cytoarchitecture with quantitative sensitivity to soma density and volume remains an unsolved challenge in dMRI.
We propose a new forward model, specifically a new system of equations, requiring a few relatively sparse b-shells.
We then apply modern tools from Bayesian analysis known as likelihood-free inference (LFI) to invert our proposed model.
arXiv Detail & Related papers (2021-11-15T09:08:27Z) - Microbiome subcommunity learning with logistic-tree normal latent
Dirichlet allocation [3.960875974762257]
Mixed-membership (MM) models have been applied to microbiome compositional data to identify latent subcommunities of microbial species.
We present a new MM model that allows variation in the composition of each subcommunity around some centroid'' composition.
arXiv Detail & Related papers (2021-09-11T22:52:12Z) - Robust Finite Mixture Regression for Heterogeneous Targets [70.19798470463378]
We propose an FMR model that finds sample clusters and jointly models multiple incomplete mixed-type targets simultaneously.
We provide non-asymptotic oracle performance bounds for our model under a high-dimensional learning framework.
The results show that our model can achieve state-of-the-art performance.
arXiv Detail & Related papers (2020-10-12T03:27:07Z) - Model Fusion with Kullback--Leibler Divergence [58.20269014662046]
We propose a method to fuse posterior distributions learned from heterogeneous datasets.
Our algorithm relies on a mean field assumption for both the fused model and the individual dataset posteriors.
arXiv Detail & Related papers (2020-07-13T03:27:45Z) - Blocked Clusterwise Regression [0.0]
We generalize previous approaches to discrete unobserved heterogeneity by allowing each unit to have multiple latent variables.
We contribute to the theory of clustering with an over-specified number of clusters and derive new convergence rates for this setting.
arXiv Detail & Related papers (2020-01-29T23:29:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.