CEIL: A General Classification-Enhanced Iterative Learning Framework for
Text Clustering
- URL: http://arxiv.org/abs/2304.11061v1
- Date: Thu, 20 Apr 2023 14:04:31 GMT
- Title: CEIL: A General Classification-Enhanced Iterative Learning Framework for
Text Clustering
- Authors: Mingjun Zhao, Mengzhen Wang, Yinglong Ma, Di Niu and Haijiang Wu
- Abstract summary: We propose a novel Classification-Enhanced Iterative Learning framework for short text clustering.
In each iteration, we first adopt a language model to retrieve the initial text representations.
After strict data filtering and aggregation processes, samples with clean category labels are retrieved, which serve as supervision information.
Finally, the updated language model with improved representation ability is used to enhance clustering in the next iteration.
- Score: 16.08402937918212
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text clustering, as one of the most fundamental challenges in unsupervised
learning, aims at grouping semantically similar text segments without relying
on human annotations. With the rapid development of deep learning, deep
clustering has achieved significant advantages over traditional clustering
methods. Despite the effectiveness, most existing deep text clustering methods
rely heavily on representations pre-trained in general domains, which may not
be the most suitable solution for clustering in specific target domains. To
address this issue, we propose CEIL, a novel Classification-Enhanced Iterative
Learning framework for short text clustering, which aims at generally promoting
the clustering performance by introducing a classification objective to
iteratively improve feature representations. In each iteration, we first adopt
a language model to retrieve the initial text representations, from which the
clustering results are collected using our proposed Category Disentangled
Contrastive Clustering (CDCC) algorithm. After strict data filtering and
aggregation processes, samples with clean category labels are retrieved, which
serve as supervision information to update the language model with the
classification objective via a prompt learning approach. Finally, the updated
language model with improved representation ability is used to enhance
clustering in the next iteration. Extensive experiments demonstrate that the
CEIL framework significantly improves the clustering performance over
iterations, and is generally effective on various clustering algorithms.
Moreover, by incorporating CEIL on CDCC, we achieve the state-of-the-art
clustering performance on a wide range of short text clustering benchmarks
outperforming other strong baseline methods.
Related papers
- Towards Realistic Zero-Shot Classification via Self Structural Semantic
Alignment [53.2701026843921]
Large-scale pre-trained Vision Language Models (VLMs) have proven effective for zero-shot classification.
In this paper, we aim at a more challenging setting, Realistic Zero-Shot Classification, which assumes no annotation but instead a broad vocabulary.
We propose the Self Structural Semantic Alignment (S3A) framework, which extracts structural semantic information from unlabeled data while simultaneously self-learning.
arXiv Detail & Related papers (2023-08-24T17:56:46Z) - Reinforcement Graph Clustering with Unknown Cluster Number [91.4861135742095]
We propose a new deep graph clustering method termed Reinforcement Graph Clustering.
In our proposed method, cluster number determination and unsupervised representation learning are unified into a uniform framework.
In order to conduct feedback actions, the clustering-oriented reward function is proposed to enhance the cohesion of the same clusters and separate the different clusters.
arXiv Detail & Related papers (2023-08-13T18:12:28Z) - Cluster-level Group Representativity Fairness in $k$-means Clustering [3.420467786581458]
Clustering algorithms could generate clusters such that different groups are disadvantaged within different clusters.
We develop a clustering algorithm, building upon the centroid clustering paradigm pioneered by classical algorithms.
We show that our method is effective in enhancing cluster-level group representativity fairness significantly at low impact on cluster coherence.
arXiv Detail & Related papers (2022-12-29T22:02:28Z) - A Proposition-Level Clustering Approach for Multi-Document Summarization [82.4616498914049]
We revisit the clustering approach, grouping together propositions for more precise information alignment.
Our method detects salient propositions, clusters them into paraphrastic clusters, and generates a representative sentence for each cluster by fusing its propositions.
Our summarization method improves over the previous state-of-the-art MDS method in the DUC 2004 and TAC 2011 datasets.
arXiv Detail & Related papers (2021-12-16T10:34:22Z) - Learning the Precise Feature for Cluster Assignment [39.320210567860485]
We propose a framework which integrates representation learning and clustering into a single pipeline for the first time.
The proposed framework exploits the powerful ability of recently developed generative models for learning intrinsic features.
Experimental results show that the performance of the proposed method is superior, or at least comparable to, the state-of-the-art methods.
arXiv Detail & Related papers (2021-06-11T04:08:54Z) - You Never Cluster Alone [150.94921340034688]
We extend the mainstream contrastive learning paradigm to a cluster-level scheme, where all the data subjected to the same cluster contribute to a unified representation.
We define a set of categorical variables as clustering assignment confidence, which links the instance-level learning track with the cluster-level one.
By reparametrizing the assignment variables, TCC is trained end-to-end, requiring no alternating steps.
arXiv Detail & Related papers (2021-06-03T14:59:59Z) - Graph Contrastive Clustering [131.67881457114316]
We propose a novel graph contrastive learning framework, which is then applied to the clustering task and we come up with the Graph Constrastive Clustering(GCC) method.
Specifically, on the one hand, the graph Laplacian based contrastive loss is proposed to learn more discriminative and clustering-friendly features.
On the other hand, a novel graph-based contrastive learning strategy is proposed to learn more compact clustering assignments.
arXiv Detail & Related papers (2021-04-03T15:32:49Z) - CAC: A Clustering Based Framework for Classification [20.372627144885158]
We design a simple, efficient, and generic framework called Classification Aware Clustering (CAC)
Our experiments on synthetic and real benchmark datasets demonstrate the efficacy of CAC over previous methods for combined clustering and classification.
arXiv Detail & Related papers (2021-02-23T18:59:39Z) - Neural Text Classification by Jointly Learning to Cluster and Align [5.969960391685054]
We extend the neural text clustering approach to text classification tasks by inducing cluster centers via a latent variable model and interacting with distributional word embeddings.
The proposed method jointly learns word clustering centroids and clustering-token alignments, achieving the state of the art results on multiple benchmark datasets.
arXiv Detail & Related papers (2020-11-24T16:07:18Z) - Scalable Hierarchical Agglomerative Clustering [65.66407726145619]
Existing scalable hierarchical clustering methods sacrifice quality for speed.
We present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points.
arXiv Detail & Related papers (2020-10-22T15:58:35Z) - Enhancement of Short Text Clustering by Iterative Classification [0.0]
iterative classification applies outlier removal to obtain outlier-free clusters.
It trains a classification algorithm using the non-outliers based on their cluster distributions.
By repeating this several times, we obtain a much improved clustering of texts.
arXiv Detail & Related papers (2020-01-31T02:12:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.