Keyword-based Topic Modeling and Keyword Selection
- URL: http://arxiv.org/abs/2001.07866v1
- Date: Wed, 22 Jan 2020 03:41:10 GMT
- Title: Keyword-based Topic Modeling and Keyword Selection
- Authors: Xingyu Wang, Lida Zhang, Diego Klabjan
- Abstract summary: We develop a keyword-based topic model that selects a subset of keywords to be used to collect future documents.
The model is trained by using a variational lower bound and gradient optimization.
We compare the keyword topic model against a benchmark model using viral predictions of tweets combined with a topic model.
- Score: 21.686391911424355
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Certain type of documents such as tweets are collected by specifying a set of
keywords. As topics of interest change with time it is beneficial to adjust
keywords dynamically. The challenge is that these need to be specified ahead of
knowing the forthcoming documents and the underlying topics. The future topics
should mimic past topics of interest yet there should be some novelty in them.
We develop a keyword-based topic model that dynamically selects a subset of
keywords to be used to collect future documents. The generative process first
selects keywords and then the underlying documents based on the specified
keywords. The model is trained by using a variational lower bound and
stochastic gradient optimization. The inference consists of finding a subset of
keywords where given a subset the model predicts the underlying topic-word
matrix for the unknown forthcoming documents. We compare the keyword topic
model against a benchmark model using viral predictions of tweets combined with
a topic model. The keyword-based topic model outperforms this sophisticated
baseline model by 67%.
Related papers
- Fine-Tuning Topics through Weighting Aspect Keywords [0.48342038441006807]
Topic modeling often requires examining topics from multiple perspectives to uncover hidden patterns.
This paper presents an approach, utilizing weighted keywords from various aspects derived from a domain knowledge.
Findings show that top-scoring documents are more likely to be about the same aspect of a topic.
arXiv Detail & Related papers (2025-02-12T15:31:16Z) - CAST: Corpus-Aware Self-similarity Enhanced Topic modelling [16.562349140796115]
We introduce CAST: Corpus-Aware Self-similarity Enhanced Topic modelling, a novel topic modelling method.
We find self-similarity to be an effective metric to prevent functional words from acting as candidate topic words.
Our approach significantly enhances the coherence and diversity of generated topics, as well as the topic model's ability to handle noisy data.
arXiv Detail & Related papers (2024-10-19T15:27:11Z) - LIST: Learning to Index Spatio-Textual Data for Embedding based Spatial Keyword Queries [53.843367588870585]
List K-kNN spatial keyword queries (TkQs) return a list of objects based on a ranking function that considers both spatial and textual relevance.
There are two key challenges in building an effective and efficient index, i.e., the absence of high-quality labels and the unbalanced results.
We develop a novel pseudolabel generation technique to address the two challenges.
arXiv Detail & Related papers (2024-03-12T05:32:33Z) - Retrieval is Accurate Generation [99.24267226311157]
We introduce a novel method that selects context-aware phrases from a collection of supporting documents.
Our model achieves the best performance and the lowest latency among several retrieval-augmented baselines.
arXiv Detail & Related papers (2024-02-27T14:16:19Z) - Revisiting Automated Topic Model Evaluation with Large Language Models [82.93251466435208]
We find that large language models appropriately assess the resulting topics.
We then investigate whether we can use large language models to automatically determine the optimal number of topics.
arXiv Detail & Related papers (2023-05-20T09:42:00Z) - CWTM: Leveraging Contextualized Word Embeddings from BERT for Neural
Topic Modeling [23.323587005085564]
We introduce a novel neural topic model called the Contextlized Word Topic Model (CWTM)
CWTM integrates contextualized word embeddings from BERT.
It is capable of learning the topic vector of a document without BOW information.
It can also derive the topic vectors for individual words within a document based on their contextualized word embeddings.
arXiv Detail & Related papers (2023-05-16T10:07:33Z) - Query Expansion Using Contextual Clue Sampling with Language Models [69.51976926838232]
We propose a combination of an effective filtering strategy and fusion of the retrieved documents based on the generation probability of each context.
Our lexical matching based approach achieves a similar top-5/top-20 retrieval accuracy and higher top-100 accuracy compared with the well-established dense retrieval model DPR.
For end-to-end QA, the reader model also benefits from our method and achieves the highest Exact-Match score against several competitive baselines.
arXiv Detail & Related papers (2022-10-13T15:18:04Z) - Knowledge-Aware Bayesian Deep Topic Model [50.58975785318575]
We propose a Bayesian generative model for incorporating prior domain knowledge into hierarchical topic modeling.
Our proposed model efficiently integrates the prior knowledge and improves both hierarchical topic discovery and document representation.
arXiv Detail & Related papers (2022-09-20T09:16:05Z) - More Than Words: Collocation Tokenization for Latent Dirichlet
Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ.
We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z) - Tired of Topic Models? Clusters of Pretrained Word Embeddings Make for
Fast and Good Topics too! [5.819224524813161]
We propose an alternative way to obtain topics: clustering pre-trained word embeddings while incorporating document information for weighted clustering and reranking top words.
The best performing combination for our approach performs as well as classical topic models, but with lower runtime and computational complexity.
arXiv Detail & Related papers (2020-04-30T16:18:18Z) - Keyword Assisted Topic Models [0.0]
We show that providing a small number of keywords can substantially enhance the measurement performance of topic models.
KeyATM provides more interpretable results, has better document classification performance, and is less sensitive to the number of topics than the standard topic models.
arXiv Detail & Related papers (2020-04-13T14:35:28Z) - VSEC-LDA: Boosting Topic Modeling with Embedded Vocabulary Selection [20.921010767231923]
We propose a new approach to topic modeling, termed Vocabulary-Selection-Embedded Correspondence-LDA (VSEC-LDA)
VSEC-LDA learns the latent model while simultaneously selecting most relevant words.
The selection of words is driven by an entropy-based metric that measures the relative contribution of the words to the underlying model.
arXiv Detail & Related papers (2020-01-15T22:16:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.