Related papers: CWTM: Leveraging Contextualized Word Embeddings from BERT for Neural Topic Modeling

CWTM: Leveraging Contextualized Word Embeddings from BERT for Neural Topic Modeling

URL: http://arxiv.org/abs/2305.09329v3
Date: Wed, 6 Mar 2024 14:56:28 GMT
Title: CWTM: Leveraging Contextualized Word Embeddings from BERT for Neural Topic Modeling
Authors: Zheng Fang, Yulan He and Rob Procter
Abstract summary: We introduce a novel neural topic model called the Contextlized Word Topic Model (CWTM) CWTM integrates contextualized word embeddings from BERT. It is capable of learning the topic vector of a document without BOW information. It can also derive the topic vectors for individual words within a document based on their contextualized word embeddings.
Score: 23.323587005085564
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Most existing topic models rely on bag-of-words (BOW) representation, which limits their ability to capture word order information and leads to challenges with out-of-vocabulary (OOV) words in new documents. Contextualized word embeddings, however, show superiority in word sense disambiguation and effectively address the OOV issue. In this work, we introduce a novel neural topic model called the Contextlized Word Topic Model (CWTM), which integrates contextualized word embeddings from BERT. The model is capable of learning the topic vector of a document without BOW information. In addition, it can also derive the topic vectors for individual words within a document based on their contextualized word embeddings. Experiments across various datasets show that CWTM generates more coherent and meaningful topics compared to existing topic models, while also accommodating unseen words in newly encountered documents.

Related papers

TopicGPT: A Prompt-based Topic Modeling Framework [77.72072691307811]
We introduce TopicGPT, a prompt-based framework that uses large language models to uncover latent topics in a text collection. It produces topics that align better with human categorizations compared to competing methods. Its topics are also interpretable, dispensing with ambiguous bags of words in favor of topics with natural language labels and associated free-form descriptions.
arXiv Detail & Related papers (2023-11-02T17:57:10Z)
Knowledge-Aware Bayesian Deep Topic Model [50.58975785318575]
We propose a Bayesian generative model for incorporating prior domain knowledge into hierarchical topic modeling. Our proposed model efficiently integrates the prior knowledge and improves both hierarchical topic discovery and document representation.
arXiv Detail & Related papers (2022-09-20T09:16:05Z)
Keyword Assisted Embedded Topic Model [1.9000421840914223]
Probabilistic topic models describe how words in documents are generated via a set of latent distributions called topics. Recently, the Embedded Topic Model (ETM) has extended LDA to utilize the semantic information in word embeddings to derive semantically richer topics. We propose the Keyword Assisted Embedded Topic Model (KeyETM), which equips ETM with the ability to incorporate user knowledge in the form of informative topic-level priors.
arXiv Detail & Related papers (2021-11-22T07:27:17Z)
Keyphrase Extraction Using Neighborhood Knowledge Based on Word Embeddings [17.198907789163123]
We enhance the graph-based ranking model by leveraging word embeddings as background knowledge to add semantic information to the inter-word graph. Our approach is evaluated on established benchmark datasets and empirical results show that the word embedding neighborhood information improves the model performance.
arXiv Detail & Related papers (2021-11-13T21:48:18Z)
Neural Attention-Aware Hierarchical Topic Model [25.721713066830404]
We propose a variational autoencoder (VAE) NTM model that jointly reconstructs the sentence and document word counts. Our model also features hierarchical KL divergence to leverage embeddings of each document to regularize those of their sentences. Both quantitative and qualitative experiments have shown the efficacy of our model in 1) lowering the reconstruction errors at both the sentence and document levels, and 2) discovering more coherent topics from real-world datasets.
arXiv Detail & Related papers (2021-10-14T05:42:32Z)
More Than Words: Collocation Tokenization for Latent Dirichlet Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ. We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z)
Accurate Word Representations with Universal Visual Guidance [55.71425503859685]
This paper proposes a visual representation method to explicitly enhance conventional word embedding with multiple-aspect senses from visual guidance. We build a small-scale word-image dictionary from a multimodal seed dataset where each word corresponds to diverse related images. Experiments on 12 natural language understanding and machine translation tasks further verify the effectiveness and the generalization capability of the proposed approach.
arXiv Detail & Related papers (2020-12-30T09:11:50Z)
TAN-NTM: Topic Attention Networks for Neural Topic Modeling [8.631228373008478]
We propose a novel framework: TAN-NTM which models document as a sequence of tokens instead of BoW at the input layer. We apply attention on LSTM outputs to empower the model to attend on relevant words which convey topic related cues. TAN-NTM achieves state-of-the-art results with 9-15 percentage improvement over score of existing SOTA topic models in NPMI coherence metric.
arXiv Detail & Related papers (2020-12-02T20:58:04Z)
A Neural Generative Model for Joint Learning Topics and Topic-Specific Word Embeddings [42.87769996249732]
We propose a novel generative model to explore both local and global context for joint learning topics and topic-specific word embeddings. The trained model maps words to topic-dependent embeddings, which naturally addresses the issue of word polysemy.
arXiv Detail & Related papers (2020-08-11T13:54:11Z)
A Survey on Contextual Embeddings [48.04732268018772]
Contextual embeddings assign each word a representation based on its context, capturing uses of words across varied contexts and encoding knowledge that transfers across languages. We review existing contextual embedding models, cross-lingual polyglot pre-training, the application of contextual embeddings in downstream tasks, model compression, and model analyses.
arXiv Detail & Related papers (2020-03-16T15:22:22Z)
Learning to Select Bi-Aspect Information for Document-Scale Text Content Manipulation [50.01708049531156]
We focus on a new practical task, document-scale text content manipulation, which is the opposite of text style transfer. In detail, the input is a set of structured records and a reference text for describing another recordset. The output is a summary that accurately describes the partial content in the source recordset with the same writing style of the reference.
arXiv Detail & Related papers (2020-02-24T12:52:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.