VSEC-LDA: Boosting Topic Modeling with Embedded Vocabulary Selection
- URL: http://arxiv.org/abs/2001.05578v1
- Date: Wed, 15 Jan 2020 22:16:24 GMT
- Title: VSEC-LDA: Boosting Topic Modeling with Embedded Vocabulary Selection
- Authors: Yuzhen Ding, Baoxin Li
- Abstract summary: We propose a new approach to topic modeling, termed Vocabulary-Selection-Embedded Correspondence-LDA (VSEC-LDA)
VSEC-LDA learns the latent model while simultaneously selecting most relevant words.
The selection of words is driven by an entropy-based metric that measures the relative contribution of the words to the underlying model.
- Score: 20.921010767231923
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Topic modeling has found wide application in many problems where latent
structures of the data are crucial for typical inference tasks. When applying a
topic model, a relatively standard pre-processing step is to first build a
vocabulary of frequent words. Such a general pre-processing step is often
independent of the topic modeling stage, and thus there is no guarantee that
the pre-generated vocabulary can support the inference of some optimal (or even
meaningful) topic models appropriate for a given task, especially for computer
vision applications involving "visual words". In this paper, we propose a new
approach to topic modeling, termed Vocabulary-Selection-Embedded
Correspondence-LDA (VSEC-LDA), which learns the latent model while
simultaneously selecting most relevant words. The selection of words is driven
by an entropy-based metric that measures the relative contribution of the words
to the underlying model, and is done dynamically while the model is learned. We
present three variants of VSEC-LDA and evaluate the proposed approach with
experiments on both synthetic and real databases from different applications.
The results demonstrate the effectiveness of built-in vocabulary selection and
its importance in improving the performance of topic modeling.
Related papers
- Enhancing Short-Text Topic Modeling with LLM-Driven Context Expansion and Prefix-Tuned VAEs [25.915607750636333]
We propose a novel approach that leverages large language models (LLMs) to extend short texts into more detailed sequences before applying topic modeling.
Our method significantly improves short-text topic modeling performance, as demonstrated by extensive experiments on real-world datasets with extreme data sparsity.
arXiv Detail & Related papers (2024-10-04T01:28:56Z) - Iterative Improvement of an Additively Regularized Topic Model [0.0]
We present a method for iterative training of a topic model.
Experiments conducted on several collections of natural language texts show that the proposed ITAR model performs better than other popular topic models.
arXiv Detail & Related papers (2024-08-11T18:22:12Z) - A Large-Scale Evaluation of Speech Foundation Models [110.95827399522204]
We establish the Speech processing Universal PERformance Benchmark (SUPERB) to study the effectiveness of the foundation model paradigm for speech.
We propose a unified multi-tasking framework to address speech processing tasks in SUPERB using a frozen foundation model followed by task-specialized, lightweight prediction heads.
arXiv Detail & Related papers (2024-04-15T00:03:16Z) - GINopic: Topic Modeling with Graph Isomorphism Network [0.8962460460173959]
We introduce GINopic, a topic modeling framework based on graph isomorphism networks to capture the correlation between words.
We demonstrate the effectiveness of GINopic compared to existing topic models and highlight its potential for advancing topic modeling.
arXiv Detail & Related papers (2024-04-02T17:18:48Z) - Integrating Self-supervised Speech Model with Pseudo Word-level Targets
from Visually-grounded Speech Model [57.78191634042409]
We propose Pseudo-Word HuBERT (PW-HuBERT), a framework that integrates pseudo word-level targets into the training process.
Our experimental results on four spoken language understanding (SLU) benchmarks suggest the superiority of our model in capturing semantic information.
arXiv Detail & Related papers (2024-02-08T16:55:21Z) - Visualizing the Relationship Between Encoded Linguistic Information and
Task Performance [53.223789395577796]
We study the dynamic relationship between the encoded linguistic information and task performance from the viewpoint of Pareto Optimality.
We conduct experiments on two popular NLP tasks, i.e., machine translation and language modeling, and investigate the relationship between several kinds of linguistic information and task performances.
Our empirical findings suggest that some syntactic information is helpful for NLP tasks whereas encoding more syntactic information does not necessarily lead to better performance.
arXiv Detail & Related papers (2022-03-29T19:03:10Z) - Topic Discovery via Latent Space Clustering of Pretrained Language Model
Representations [35.74225306947918]
We propose a joint latent space learning and clustering framework built upon PLM embeddings.
Our model effectively leverages the strong representation power and superb linguistic features brought by PLMs for topic discovery.
arXiv Detail & Related papers (2022-02-09T17:26:08Z) - Evaluation of Audio-Visual Alignments in Visually Grounded Speech Models [2.1320960069210484]
This work studies multimodal learning in context of visually grounded speech (VGS) models.
We introduce systematic metrics for evaluating model performance in aligning visual objects and spoken words.
We show that cross-modal attention helps the model to achieve higher semantic cross-modal retrieval performance.
arXiv Detail & Related papers (2021-07-05T12:54:05Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z) - Behind the Scene: Revealing the Secrets of Pre-trained
Vision-and-Language Models [65.19308052012858]
Recent Transformer-based large-scale pre-trained models have revolutionized vision-and-language (V+L) research.
We present VALUE, a set of meticulously designed probing tasks to decipher the inner workings of multimodal pre-training.
Key observations: Pre-trained models exhibit a propensity for attending over text rather than images during inference.
arXiv Detail & Related papers (2020-05-15T01:06:54Z) - How Far are We from Effective Context Modeling? An Exploratory Study on
Semantic Parsing in Context [59.13515950353125]
We present a grammar-based decoding semantic parsing and adapt typical context modeling methods on top of it.
We evaluate 13 context modeling methods on two large cross-domain datasets, and our best model achieves state-of-the-art performances.
arXiv Detail & Related papers (2020-02-03T11:28:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.