ClusTop: An unsupervised and integrated text clustering and topic
extraction framework
- URL: http://arxiv.org/abs/2301.00818v1
- Date: Tue, 3 Jan 2023 03:26:26 GMT
- Title: ClusTop: An unsupervised and integrated text clustering and topic
extraction framework
- Authors: Zhongtao Chen, Chenghu Mi, Siwei Duo, Jingfei He, Yatong Zhou
- Abstract summary: We propose an unsupervised text clustering and topic extraction framework (ClusTop)
Our framework includes four components: enhanced language model training, dimensionality reduction, clustering and topic extraction.
Experiments on two datasets demonstrate the effectiveness of our framework.
- Score: 3.3073775218038883
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text clustering and topic extraction are two important tasks in text mining.
Usually, these two tasks are performed separately. For topic extraction to
facilitate clustering, we can first project texts into a topic space and then
perform a clustering algorithm to obtain clusters. To promote topic extraction
by clustering, we can first obtain clusters with a clustering algorithm and
then extract cluster-specific topics. However, this naive strategy ignores the
fact that text clustering and topic extraction are strongly correlated and
follow a chicken-and-egg relationship. Performing them separately fails to make
them mutually benefit each other to achieve the best overall performance. In
this paper, we propose an unsupervised text clustering and topic extraction
framework (ClusTop) which integrates text clustering and topic extraction into
a unified framework and can achieve high-quality clustering result and extract
topics from each cluster simultaneously. Our framework includes four
components: enhanced language model training, dimensionality reduction,
clustering and topic extraction, where the enhanced language model can be
viewed as a bridge between clustering and topic extraction. On one hand, it
provides text embeddings with a strong cluster structure which facilitates
effective text clustering; on the other hand, it pays high attention on the
topic related words for topic extraction because of its self-attention
architecture. Moreover, the training of enhanced language model is
unsupervised. Experiments on two datasets demonstrate the effectiveness of our
framework and provide benchmarks for different model combinations in this
framework.
Related papers
- Contrastive Learning Subspace for Text Clustering [4.065026352441705]
We propose a novel text clustering approach called Subspace Contrastive Learning (SCL)
The proposed SCL consists of two main modules: (1) a self-expressive module that constructs virtual positive samples and (2) a contrastive learning module that further learns a discriminative subspace to capture task-specific cluster-wise relationships among texts.
Experimental results show that the proposed SCL method not only has achieved superior results on multiple task clustering datasets but also has less complexity in positive sample construction.
arXiv Detail & Related papers (2024-08-26T09:08:26Z) - JADS: A Framework for Self-supervised Joint Aspect Discovery and Summarization [3.992091862806936]
Our solution integrates topic discovery and summarization into a single step.
Given text data, our Joint Aspect Discovery and Summarization algorithm (JADS) discovers aspects from the input.
Our proposed method achieves higher semantic alignment with ground truth and is factual.
arXiv Detail & Related papers (2024-05-28T23:01:57Z) - Context-Aware Clustering using Large Language Models [20.971691166166547]
We propose CACTUS (Context-Aware ClusTering with aUgmented triplet losS) for efficient and effective supervised clustering of entity subsets.
This paper introduces a novel approach towards clustering entity subsets using Large Language Models (LLMs) by capturing context via a scalable inter-entity attention mechanism.
arXiv Detail & Related papers (2024-05-02T03:50:31Z) - Reinforcement Graph Clustering with Unknown Cluster Number [91.4861135742095]
We propose a new deep graph clustering method termed Reinforcement Graph Clustering.
In our proposed method, cluster number determination and unsupervised representation learning are unified into a uniform framework.
In order to conduct feedback actions, the clustering-oriented reward function is proposed to enhance the cohesion of the same clusters and separate the different clusters.
arXiv Detail & Related papers (2023-08-13T18:12:28Z) - CEIL: A General Classification-Enhanced Iterative Learning Framework for
Text Clustering [16.08402937918212]
We propose a novel Classification-Enhanced Iterative Learning framework for short text clustering.
In each iteration, we first adopt a language model to retrieve the initial text representations.
After strict data filtering and aggregation processes, samples with clean category labels are retrieved, which serve as supervision information.
Finally, the updated language model with improved representation ability is used to enhance clustering in the next iteration.
arXiv Detail & Related papers (2023-04-20T14:04:31Z) - Deep Clustering: A Comprehensive Survey [53.387957674512585]
Clustering analysis plays an indispensable role in machine learning and data mining.
Deep clustering, which can learn clustering-friendly representations using deep neural networks, has been broadly applied in a wide range of clustering tasks.
Existing surveys for deep clustering mainly focus on the single-view fields and the network architectures, ignoring the complex application scenarios of clustering.
arXiv Detail & Related papers (2022-10-09T02:31:32Z) - DeepCluE: Enhanced Image Clustering via Multi-layer Ensembles in Deep
Neural Networks [53.88811980967342]
This paper presents a Deep Clustering via Ensembles (DeepCluE) approach.
It bridges the gap between deep clustering and ensemble clustering by harnessing the power of multiple layers in deep neural networks.
Experimental results on six image datasets confirm the advantages of DeepCluE over the state-of-the-art deep clustering approaches.
arXiv Detail & Related papers (2022-06-01T09:51:38Z) - You Never Cluster Alone [150.94921340034688]
We extend the mainstream contrastive learning paradigm to a cluster-level scheme, where all the data subjected to the same cluster contribute to a unified representation.
We define a set of categorical variables as clustering assignment confidence, which links the instance-level learning track with the cluster-level one.
By reparametrizing the assignment variables, TCC is trained end-to-end, requiring no alternating steps.
arXiv Detail & Related papers (2021-06-03T14:59:59Z) - Graph Contrastive Clustering [131.67881457114316]
We propose a novel graph contrastive learning framework, which is then applied to the clustering task and we come up with the Graph Constrastive Clustering(GCC) method.
Specifically, on the one hand, the graph Laplacian based contrastive loss is proposed to learn more discriminative and clustering-friendly features.
On the other hand, a novel graph-based contrastive learning strategy is proposed to learn more compact clustering assignments.
arXiv Detail & Related papers (2021-04-03T15:32:49Z) - Relation Clustering in Narrative Knowledge Graphs [71.98234178455398]
relational sentences in the original text are embedded (with SBERT) and clustered in order to merge together semantically similar relations.
Preliminary tests show that such clustering might successfully detect similar relations, and provide a valuable preprocessing for semi-supervised approaches.
arXiv Detail & Related papers (2020-11-27T10:43:04Z) - Neural Text Classification by Jointly Learning to Cluster and Align [5.969960391685054]
We extend the neural text clustering approach to text classification tasks by inducing cluster centers via a latent variable model and interacting with distributional word embeddings.
The proposed method jointly learns word clustering centroids and clustering-token alignments, achieving the state of the art results on multiple benchmark datasets.
arXiv Detail & Related papers (2020-11-24T16:07:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.