ClusterLLM: Large Language Models as a Guide for Text Clustering
- URL: http://arxiv.org/abs/2305.14871v2
- Date: Fri, 3 Nov 2023 19:40:21 GMT
- Title: ClusterLLM: Large Language Models as a Guide for Text Clustering
- Authors: Yuwei Zhang, Zihan Wang, Jingbo Shang
- Abstract summary: We introduce ClusterLLM, a novel text clustering framework that leverages feedback from an instruction-tuned large language model, such as ChatGPT.
ClusterLLM consistently improves clustering quality, at an average cost of $0.6 per dataset.
- Score: 45.835625439515
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce ClusterLLM, a novel text clustering framework that leverages
feedback from an instruction-tuned large language model, such as ChatGPT.
Compared with traditional unsupervised methods that builds upon "small"
embedders, ClusterLLM exhibits two intriguing advantages: (1) it enjoys the
emergent capability of LLM even if its embeddings are inaccessible; and (2) it
understands the user's preference on clustering through textual instruction
and/or a few annotated data. First, we prompt ChatGPT for insights on
clustering perspective by constructing hard triplet questions <does A better
correspond to B than C>, where A, B and C are similar data points that belong
to different clusters according to small embedder. We empirically show that
this strategy is both effective for fine-tuning small embedder and
cost-efficient to query ChatGPT. Second, we prompt ChatGPT for helps on
clustering granularity by carefully designed pairwise questions <do A and B
belong to the same category>, and tune the granularity from cluster hierarchies
that is the most consistent with the ChatGPT answers. Extensive experiments on
14 datasets show that ClusterLLM consistently improves clustering quality, at
an average cost of ~$0.6 per dataset. The code will be available at
https://github.com/zhang-yu-wei/ClusterLLM.
Related papers
- Text Clustering as Classification with LLMs [6.030435811868953]
This study presents a novel framework for text clustering that effectively leverages the in-context learning capacity of Large Language Models (LLMs)
Instead of fine-tuning embedders, we propose to transform the text clustering into a classification task via LLM.
Our framework has been experimentally proven to achieve comparable or superior performance to state-of-the-art clustering methods.
arXiv Detail & Related papers (2024-09-30T16:57:34Z) - ABCDE: Application-Based Cluster Diff Evals [49.1574468325115]
It aims to be practical: it allows items to have associated importance values that are application-specific, it is frugal in its use of human judgements when determining which clustering is better, and it can report metrics for arbitrary slices of items.
The approach to measuring the delta in the clustering quality is novel: instead of trying to construct an expensive ground truth up front and evaluating the each clustering with respect to that, ABCDE samples questions for judgement on the basis of the actual diffs between the clusterings.
arXiv Detail & Related papers (2024-07-31T08:29:35Z) - Reinforcement Graph Clustering with Unknown Cluster Number [91.4861135742095]
We propose a new deep graph clustering method termed Reinforcement Graph Clustering.
In our proposed method, cluster number determination and unsupervised representation learning are unified into a uniform framework.
In order to conduct feedback actions, the clustering-oriented reward function is proposed to enhance the cohesion of the same clusters and separate the different clusters.
arXiv Detail & Related papers (2023-08-13T18:12:28Z) - Large Language Models Enable Few-Shot Clustering [88.06276828752553]
We show that large language models can amplify an expert's guidance to enable query-efficient, few-shot semi-supervised text clustering.
We find incorporating LLMs in the first two stages can routinely provide significant improvements in cluster quality.
arXiv Detail & Related papers (2023-07-02T09:17:11Z) - Deep Multi-View Subspace Clustering with Anchor Graph [11.291831842959926]
We propose a novel deep multi-view subspace clustering method with anchor graph (DMCAG)
DMCAG learns the embedded features for each view independently, which are used to obtain the subspace representations.
Our method achieves superior clustering performance over other state-of-the-art methods.
arXiv Detail & Related papers (2023-05-11T16:17:43Z) - Self-supervised Contrastive Attributed Graph Clustering [110.52694943592974]
We propose a novel attributed graph clustering network, namely Self-supervised Contrastive Attributed Graph Clustering (SCAGC)
In SCAGC, by leveraging inaccurate clustering labels, a self-supervised contrastive loss, are designed for node representation learning.
For the OOS nodes, SCAGC can directly calculate their clustering labels.
arXiv Detail & Related papers (2021-10-15T03:25:28Z) - You Never Cluster Alone [150.94921340034688]
We extend the mainstream contrastive learning paradigm to a cluster-level scheme, where all the data subjected to the same cluster contribute to a unified representation.
We define a set of categorical variables as clustering assignment confidence, which links the instance-level learning track with the cluster-level one.
By reparametrizing the assignment variables, TCC is trained end-to-end, requiring no alternating steps.
arXiv Detail & Related papers (2021-06-03T14:59:59Z) - Unsupervised Visual Representation Learning by Online Constrained
K-Means [44.38989920488318]
Cluster discrimination is an effective pretext task for unsupervised representation learning.
We propose a novel clustering-based pretext task with online textbfConstrained textbfK-mtextbfeans (textbfCoKe)
Our online assignment method has a theoretical guarantee to approach the global optimum.
arXiv Detail & Related papers (2021-05-24T20:38:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.