CWTM: Leveraging Contextualized Word Embeddings from BERT for Neural
Topic Modeling
- URL: http://arxiv.org/abs/2305.09329v3
- Date: Wed, 6 Mar 2024 14:56:28 GMT
- Title: CWTM: Leveraging Contextualized Word Embeddings from BERT for Neural
Topic Modeling
- Authors: Zheng Fang, Yulan He and Rob Procter
- Abstract summary: We introduce a novel neural topic model called the Contextlized Word Topic Model (CWTM)
CWTM integrates contextualized word embeddings from BERT.
It is capable of learning the topic vector of a document without BOW information.
It can also derive the topic vectors for individual words within a document based on their contextualized word embeddings.
- Score: 23.323587005085564
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Most existing topic models rely on bag-of-words (BOW) representation, which
limits their ability to capture word order information and leads to challenges
with out-of-vocabulary (OOV) words in new documents. Contextualized word
embeddings, however, show superiority in word sense disambiguation and
effectively address the OOV issue. In this work, we introduce a novel neural
topic model called the Contextlized Word Topic Model (CWTM), which integrates
contextualized word embeddings from BERT. The model is capable of learning the
topic vector of a document without BOW information. In addition, it can also
derive the topic vectors for individual words within a document based on their
contextualized word embeddings. Experiments across various datasets show that
CWTM generates more coherent and meaningful topics compared to existing topic
models, while also accommodating unseen words in newly encountered documents.
Related papers
- TopicGPT: A Prompt-based Topic Modeling Framework [77.72072691307811]
We introduce TopicGPT, a prompt-based framework that uses large language models to uncover latent topics in a text collection.
It produces topics that align better with human categorizations compared to competing methods.
Its topics are also interpretable, dispensing with ambiguous bags of words in favor of topics with natural language labels and associated free-form descriptions.
arXiv Detail & Related papers (2023-11-02T17:57:10Z) - Knowledge-Aware Bayesian Deep Topic Model [50.58975785318575]
We propose a Bayesian generative model for incorporating prior domain knowledge into hierarchical topic modeling.
Our proposed model efficiently integrates the prior knowledge and improves both hierarchical topic discovery and document representation.
arXiv Detail & Related papers (2022-09-20T09:16:05Z) - Keyword Assisted Embedded Topic Model [1.9000421840914223]
Probabilistic topic models describe how words in documents are generated via a set of latent distributions called topics.
Recently, the Embedded Topic Model (ETM) has extended LDA to utilize the semantic information in word embeddings to derive semantically richer topics.
We propose the Keyword Assisted Embedded Topic Model (KeyETM), which equips ETM with the ability to incorporate user knowledge in the form of informative topic-level priors.
arXiv Detail & Related papers (2021-11-22T07:27:17Z) - Keyphrase Extraction Using Neighborhood Knowledge Based on Word
Embeddings [17.198907789163123]
We enhance the graph-based ranking model by leveraging word embeddings as background knowledge to add semantic information to the inter-word graph.
Our approach is evaluated on established benchmark datasets and empirical results show that the word embedding neighborhood information improves the model performance.
arXiv Detail & Related papers (2021-11-13T21:48:18Z) - Neural Attention-Aware Hierarchical Topic Model [25.721713066830404]
We propose a variational autoencoder (VAE) NTM model that jointly reconstructs the sentence and document word counts.
Our model also features hierarchical KL divergence to leverage embeddings of each document to regularize those of their sentences.
Both quantitative and qualitative experiments have shown the efficacy of our model in 1) lowering the reconstruction errors at both the sentence and document levels, and 2) discovering more coherent topics from real-world datasets.
arXiv Detail & Related papers (2021-10-14T05:42:32Z) - More Than Words: Collocation Tokenization for Latent Dirichlet
Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ.
We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z) - Accurate Word Representations with Universal Visual Guidance [55.71425503859685]
This paper proposes a visual representation method to explicitly enhance conventional word embedding with multiple-aspect senses from visual guidance.
We build a small-scale word-image dictionary from a multimodal seed dataset where each word corresponds to diverse related images.
Experiments on 12 natural language understanding and machine translation tasks further verify the effectiveness and the generalization capability of the proposed approach.
arXiv Detail & Related papers (2020-12-30T09:11:50Z) - TAN-NTM: Topic Attention Networks for Neural Topic Modeling [8.631228373008478]
We propose a novel framework: TAN-NTM which models document as a sequence of tokens instead of BoW at the input layer.
We apply attention on LSTM outputs to empower the model to attend on relevant words which convey topic related cues.
TAN-NTM achieves state-of-the-art results with 9-15 percentage improvement over score of existing SOTA topic models in NPMI coherence metric.
arXiv Detail & Related papers (2020-12-02T20:58:04Z) - A Neural Generative Model for Joint Learning Topics and Topic-Specific
Word Embeddings [42.87769996249732]
We propose a novel generative model to explore both local and global context for joint learning topics and topic-specific word embeddings.
The trained model maps words to topic-dependent embeddings, which naturally addresses the issue of word polysemy.
arXiv Detail & Related papers (2020-08-11T13:54:11Z) - A Survey on Contextual Embeddings [48.04732268018772]
Contextual embeddings assign each word a representation based on its context, capturing uses of words across varied contexts and encoding knowledge that transfers across languages.
We review existing contextual embedding models, cross-lingual polyglot pre-training, the application of contextual embeddings in downstream tasks, model compression, and model analyses.
arXiv Detail & Related papers (2020-03-16T15:22:22Z) - Learning to Select Bi-Aspect Information for Document-Scale Text Content
Manipulation [50.01708049531156]
We focus on a new practical task, document-scale text content manipulation, which is the opposite of text style transfer.
In detail, the input is a set of structured records and a reference text for describing another recordset.
The output is a summary that accurately describes the partial content in the source recordset with the same writing style of the reference.
arXiv Detail & Related papers (2020-02-24T12:52:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.