Tired of Topic Models? Clusters of Pretrained Word Embeddings Make for
Fast and Good Topics too!
- URL: http://arxiv.org/abs/2004.14914v2
- Date: Tue, 6 Oct 2020 19:23:46 GMT
- Title: Tired of Topic Models? Clusters of Pretrained Word Embeddings Make for
Fast and Good Topics too!
- Authors: Suzanna Sia, Ayush Dalmia, Sabrina J. Mielke
- Abstract summary: We propose an alternative way to obtain topics: clustering pre-trained word embeddings while incorporating document information for weighted clustering and reranking top words.
The best performing combination for our approach performs as well as classical topic models, but with lower runtime and computational complexity.
- Score: 5.819224524813161
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Topic models are a useful analysis tool to uncover the underlying themes
within document collections. The dominant approach is to use probabilistic
topic models that posit a generative story, but in this paper we propose an
alternative way to obtain topics: clustering pre-trained word embeddings while
incorporating document information for weighted clustering and reranking top
words. We provide benchmarks for the combination of different word embeddings
and clustering algorithms, and analyse their performance under dimensionality
reduction with PCA. The best performing combination for our approach performs
as well as classical topic models, but with lower runtime and computational
complexity.
Related papers
- CAST: Corpus-Aware Self-similarity Enhanced Topic modelling [16.562349140796115]
We introduce CAST: Corpus-Aware Self-similarity Enhanced Topic modelling, a novel topic modelling method.
We find self-similarity to be an effective metric to prevent functional words from acting as candidate topic words.
Our approach significantly enhances the coherence and diversity of generated topics, as well as the topic model's ability to handle noisy data.
arXiv Detail & Related papers (2024-10-19T15:27:11Z) - Semantic-Driven Topic Modeling Using Transformer-Based Embeddings and Clustering Algorithms [6.349503549199403]
This study introduces an innovative end-to-end semantic-driven topic modeling technique for the topic extraction process.
Our model generates document embeddings using pre-trained transformer-based language models.
Compared to ChatGPT and traditional topic modeling algorithms, our model provides more coherent and meaningful topics.
arXiv Detail & Related papers (2024-09-30T18:15:31Z) - Interactive Topic Models with Optimal Transport [75.26555710661908]
We present EdTM, as an approach for label name supervised topic modeling.
EdTM models topic modeling as an assignment problem while leveraging LM/LLM based document-topic affinities.
arXiv Detail & Related papers (2024-06-28T13:57:27Z) - Improving Contextualized Topic Models with Negative Sampling [3.708656266586146]
We propose a negative sampling mechanism for a contextualized topic model to improve the quality of the generated topics.
In particular, during model training, we perturb the generated document-topic vector and use a triplet loss to encourage the document reconstructed from the correct document-topic vector to be similar to the input document.
arXiv Detail & Related papers (2023-03-27T07:28:46Z) - Knowledge-Aware Bayesian Deep Topic Model [50.58975785318575]
We propose a Bayesian generative model for incorporating prior domain knowledge into hierarchical topic modeling.
Our proposed model efficiently integrates the prior knowledge and improves both hierarchical topic discovery and document representation.
arXiv Detail & Related papers (2022-09-20T09:16:05Z) - Representing Mixtures of Word Embeddings with Mixtures of Topic
Embeddings [46.324584649014284]
A topic model is often formulated as a generative model that explains how each word of a document is generated given a set of topics and document-specific topic proportions.
This paper introduces a new topic-modeling framework where each document is viewed as a set of word embedding vectors and each topic is modeled as an embedding vector in the same embedding space.
Embedding the words and topics in the same vector space, we define a method to measure the semantic difference between the embedding vectors of the words of a document and these of the topics, and optimize the topic embeddings to minimize the expected difference over all documents.
arXiv Detail & Related papers (2022-03-03T08:46:23Z) - TopicNet: Semantic Graph-Guided Topic Discovery [51.71374479354178]
Existing deep hierarchical topic models are able to extract semantically meaningful topics from a text corpus in an unsupervised manner.
We introduce TopicNet as a deep hierarchical topic model that can inject prior structural knowledge as an inductive bias to influence learning.
arXiv Detail & Related papers (2021-10-27T09:07:14Z) - Author Clustering and Topic Estimation for Short Texts [69.54017251622211]
We propose a novel model that expands on the Latent Dirichlet Allocation by modeling strong dependence among the words in the same document.
We also simultaneously cluster users, removing the need for post-hoc cluster estimation.
Our method performs as well as -- or better -- than traditional approaches to problems arising in short text.
arXiv Detail & Related papers (2021-06-15T20:55:55Z) - Improving Neural Topic Models using Knowledge Distillation [84.66983329587073]
We use knowledge distillation to combine the best attributes of probabilistic topic models and pretrained transformers.
Our modular method can be straightforwardly applied with any neural topic model to improve topic quality.
arXiv Detail & Related papers (2020-10-05T22:49:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.