Topic Discovery via Latent Space Clustering of Pretrained Language Model
Representations
- URL: http://arxiv.org/abs/2202.04582v1
- Date: Wed, 9 Feb 2022 17:26:08 GMT
- Title: Topic Discovery via Latent Space Clustering of Pretrained Language Model
Representations
- Authors: Yu Meng, Yunyi Zhang, Jiaxin Huang, Yu Zhang, Jiawei Han
- Abstract summary: We propose a joint latent space learning and clustering framework built upon PLM embeddings.
Our model effectively leverages the strong representation power and superb linguistic features brought by PLMs for topic discovery.
- Score: 35.74225306947918
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Topic models have been the prominent tools for automatic topic discovery from
text corpora. Despite their effectiveness, topic models suffer from several
limitations including the inability of modeling word ordering information in
documents, the difficulty of incorporating external linguistic knowledge, and
the lack of both accurate and efficient inference methods for approximating the
intractable posterior. Recently, pretrained language models (PLMs) have brought
astonishing performance improvements to a wide variety of tasks due to their
superior representations of text. Interestingly, there have not been standard
approaches to deploy PLMs for topic discovery as better alternatives to topic
models. In this paper, we begin by analyzing the challenges of using PLM
representations for topic discovery, and then propose a joint latent space
learning and clustering framework built upon PLM embeddings. In the latent
space, topic-word and document-topic distributions are jointly modeled so that
the discovered topics can be interpreted by coherent and distinctive terms and
meanwhile serve as meaningful summaries of the documents. Our model effectively
leverages the strong representation power and superb linguistic features
brought by PLMs for topic discovery, and is conceptually simpler than topic
models. On two benchmark datasets in different domains, our model generates
significantly more coherent and diverse topics than strong topic models, and
offers better topic-wise document representations, based on both automatic and
human evaluations.
Related papers
- Unified Generative and Discriminative Training for Multi-modal Large Language Models [88.84491005030316]
Generative training has enabled Vision-Language Models (VLMs) to tackle various complex tasks.
Discriminative training, exemplified by models like CLIP, excels in zero-shot image-text classification and retrieval.
This paper proposes a unified approach that integrates the strengths of both paradigms.
arXiv Detail & Related papers (2024-11-01T01:51:31Z) - Enhancing Short-Text Topic Modeling with LLM-Driven Context Expansion and Prefix-Tuned VAEs [25.915607750636333]
We propose a novel approach that leverages large language models (LLMs) to extend short texts into more detailed sequences before applying topic modeling.
Our method significantly improves short-text topic modeling performance, as demonstrated by extensive experiments on real-world datasets with extreme data sparsity.
arXiv Detail & Related papers (2024-10-04T01:28:56Z) - Interactive Topic Models with Optimal Transport [75.26555710661908]
We present EdTM, as an approach for label name supervised topic modeling.
EdTM models topic modeling as an assignment problem while leveraging LM/LLM based document-topic affinities.
arXiv Detail & Related papers (2024-06-28T13:57:27Z) - Enhanced Short Text Modeling: Leveraging Large Language Models for Topic Refinement [7.6115889231452964]
We introduce a novel approach termed "Topic Refinement"
This approach does not directly involve itself in the initial modeling of topics but focuses on improving topics after they have been mined.
By employing prompt engineering, we direct LLMs to eliminate off-topic words within a given topic, ensuring that only contextually relevant words are preserved or substituted with ones that fit better semantically.
arXiv Detail & Related papers (2024-03-26T13:50:34Z) - Large Language Models Offer an Alternative to the Traditional Approach of Topic Modelling [0.9095496510579351]
We investigate the untapped potential of large language models (LLMs) as an alternative for uncovering the underlying topics within extensive text corpora.
Our findings indicate that LLMs with appropriate prompts can stand out as a viable alternative, capable of generating relevant topic titles and adhering to human guidelines to refine and merge topics.
arXiv Detail & Related papers (2024-03-24T17:39:51Z) - Prompting Large Language Models for Topic Modeling [10.31712610860913]
We propose PromptTopic, a novel topic modeling approach that harnesses the advanced language understanding of large language models (LLMs)
It involves extracting topics at the sentence level from individual documents, then aggregating and condensing these topics into a predefined quantity, ultimately providing coherent topics for texts of varying lengths.
We benchmark PromptTopic against the state-of-the-art baselines on three vastly diverse datasets, establishing its proficiency in discovering meaningful topics.
arXiv Detail & Related papers (2023-12-15T11:15:05Z) - Let the Pretrained Language Models "Imagine" for Short Texts Topic
Modeling [29.87929724277381]
In short texts, co-occurrence information is minimal, which results in feature sparsity in document representation.
Existing topic models (probabilistic or neural) mostly fail to mine patterns from them to generate coherent topics.
We extend short text into longer sequences using existing pre-trained language models (PLMs)
arXiv Detail & Related papers (2023-10-24T00:23:30Z) - Knowledge-Aware Bayesian Deep Topic Model [50.58975785318575]
We propose a Bayesian generative model for incorporating prior domain knowledge into hierarchical topic modeling.
Our proposed model efficiently integrates the prior knowledge and improves both hierarchical topic discovery and document representation.
arXiv Detail & Related papers (2022-09-20T09:16:05Z) - Topic-Aware Multi-turn Dialogue Modeling [91.52820664879432]
This paper presents a novel solution for multi-turn dialogue modeling, which segments and extracts topic-aware utterances in an unsupervised way.
Our topic-aware modeling is implemented by a newly proposed unsupervised topic-aware segmentation algorithm and Topic-Aware Dual-attention Matching (TADAM) Network.
arXiv Detail & Related papers (2020-09-26T08:43:06Z) - Topic Adaptation and Prototype Encoding for Few-Shot Visual Storytelling [81.33107307509718]
We propose a topic adaptive storyteller to model the ability of inter-topic generalization.
We also propose a prototype encoding structure to model the ability of intra-topic derivation.
Experimental results show that topic adaptation and prototype encoding structure mutually bring benefit to the few-shot model.
arXiv Detail & Related papers (2020-08-11T03:55:11Z) - How Far are We from Effective Context Modeling? An Exploratory Study on
Semantic Parsing in Context [59.13515950353125]
We present a grammar-based decoding semantic parsing and adapt typical context modeling methods on top of it.
We evaluate 13 context modeling methods on two large cross-domain datasets, and our best model achieves state-of-the-art performances.
arXiv Detail & Related papers (2020-02-03T11:28:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.