Multivariate Gaussian Topic Modelling: A novel approach to discover topics with greater semantic coherence
- URL: http://arxiv.org/abs/2503.15036v1
- Date: Wed, 19 Mar 2025 09:25:54 GMT
- Title: Multivariate Gaussian Topic Modelling: A novel approach to discover topics with greater semantic coherence
- Authors: Satyajeet Sahoo, Jhareswar Maiti, Virendra Kumar Tewari,
- Abstract summary: We propose a novel Multivariate Gaussian Topic modelling (MGD) approach.<n>The approach is first applied on a synthetic dataset to demonstrate the interpretability benefits vis-a-vis LDA.<n>This model achieves a higher mean topic coherence of 0.436 vis-a-vis 0.294 for LDA.
- Score: 3.360457684855856
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: An important aspect of text mining involves information retrieval in form of discovery of semantic themes (topics) from documents using topic modelling. While generative topic models like Latent Dirichlet Allocation (LDA) elegantly model topics as probability distributions and are useful in identifying latent topics from large document corpora with minimal supervision, they suffer from difficulty in topic interpretability and reduced performance in shorter texts. Here we propose a novel Multivariate Gaussian Topic modelling (MGD) approach. In this approach topics are presented as Multivariate Gaussian Distributions and documents as Gaussian Mixture Models. Using EM algorithm, the various constituent Multivariate Gaussian Distributions and their corresponding parameters are identified. Analysis of the parameters helps identify the keywords having the highest variance and mean contributions to the topic, and from these key-words topic annotations are carried out. This approach is first applied on a synthetic dataset to demonstrate the interpretability benefits vis-\`a-vis LDA. A real-world application of this topic model is demonstrated in analysis of risks and hazards at a petrochemical plant by applying the model on safety incident reports to identify the major latent hazards plaguing the plant. This model achieves a higher mean topic coherence of 0.436 vis-\`a-vis 0.294 for LDA.
Related papers
- Reliability of Topic Modeling [0.3759936323189418]
We show that the standard practice for quantifying topic model reliability fails to capture essential aspects of the variation in two widely-used topic models.
On synthetic and real-world data, we show that McDonald's $omega$ provides the best encapsulation of reliability.
arXiv Detail & Related papers (2024-10-30T16:42:04Z) - Investigating the Impact of Text Summarization on Topic Modeling [13.581341206178525]
In this paper, an approach is proposed that further enhances topic modeling performance by utilizing a pre-trained large language model (LLM)
Few shot prompting is used to generate summaries of different lengths to compare their impact on topic modeling.
The proposed method yields better topic diversity and comparable coherence values compared to previous models.
arXiv Detail & Related papers (2024-09-28T19:45:45Z) - SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction From Pre-Trained Foundation Models [85.67096251281191]
We present an innovative approach to model fusion called zero-shot Sparse MIxture of Low-rank Experts (SMILE) construction.
SMILE allows for the upscaling of source models into an MoE model without extra data or further training.
We conduct extensive experiments across diverse scenarios, such as image classification and text generation tasks, using full fine-tuning and LoRA fine-tuning.
arXiv Detail & Related papers (2024-08-19T17:32:15Z) - Heterogeneous Multi-Task Gaussian Cox Processes [61.67344039414193]
We present a novel extension of multi-task Gaussian Cox processes for modeling heterogeneous correlated tasks jointly.
A MOGP prior over the parameters of the dedicated likelihoods for classification, regression and point process tasks can facilitate sharing of information between heterogeneous tasks.
We derive a mean-field approximation to realize closed-form iterative updates for estimating model parameters.
arXiv Detail & Related papers (2023-08-29T15:01:01Z) - A Data-driven Latent Semantic Analysis for Automatic Text Summarization
using LDA Topic Modelling [0.0]
This study presents the Latent Dirichlet Allocation (LDA) approach used to perform topic modelling.
The visualisation provides an overarching view of the main topics while allowing and attributing deep meaning to the prevalence individual topic.
The results suggest the terms ranked purely by considering their probability of the topic prevalence within the processed document.
arXiv Detail & Related papers (2022-07-23T11:04:03Z) - ER: Equivariance Regularizer for Knowledge Graph Completion [107.51609402963072]
We propose a new regularizer, namely, Equivariance Regularizer (ER)
ER can enhance the generalization ability of the model by employing the semantic equivariance between the head and tail entities.
The experimental results indicate a clear and substantial improvement over the state-of-the-art relation prediction methods.
arXiv Detail & Related papers (2022-06-24T08:18:05Z) - Topic Analysis for Text with Side Data [18.939336393665553]
We introduce a hybrid generative probabilistic model that combines a neural network with a latent topic model.
In the model, each document is modeled as a finite mixture over an underlying set of topics.
Each topic is modeled as an infinite mixture over an underlying set of topic probabilities.
arXiv Detail & Related papers (2022-03-01T22:06:30Z) - Topic Discovery via Latent Space Clustering of Pretrained Language Model
Representations [35.74225306947918]
We propose a joint latent space learning and clustering framework built upon PLM embeddings.
Our model effectively leverages the strong representation power and superb linguistic features brought by PLMs for topic discovery.
arXiv Detail & Related papers (2022-02-09T17:26:08Z) - Understanding Overparameterization in Generative Adversarial Networks [56.57403335510056]
Generative Adversarial Networks (GANs) are used to train non- concave mini-max optimization problems.
A theory has shown the importance of the gradient descent (GD) to globally optimal solutions.
We show that in an overized GAN with a $1$-layer neural network generator and a linear discriminator, the GDA converges to a global saddle point of the underlying non- concave min-max problem.
arXiv Detail & Related papers (2021-04-12T16:23:37Z) - Deep Autoencoding Topic Model with Scalable Hybrid Bayesian Inference [55.35176938713946]
We develop deep autoencoding topic model (DATM) that uses a hierarchy of gamma distributions to construct its multi-stochastic-layer generative network.
We propose a Weibull upward-downward variational encoder that deterministically propagates information upward via a deep neural network, followed by a downward generative model.
The efficacy and scalability of our models are demonstrated on both unsupervised and supervised learning tasks on big corpora.
arXiv Detail & Related papers (2020-06-15T22:22:56Z) - Bayesian Sparse Factor Analysis with Kernelized Observations [67.60224656603823]
Multi-view problems can be faced with latent variable models.
High-dimensionality and non-linear issues are traditionally handled by kernel methods.
We propose merging both approaches into single model.
arXiv Detail & Related papers (2020-06-01T14:25:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.