Text Clustering as Classification with LLMs
- URL: http://arxiv.org/abs/2410.00927v2
- Date: Thu, 02 Jan 2025 08:53:50 GMT
- Title: Text Clustering as Classification with LLMs
- Authors: Chen Huang, Guoxiu He,
- Abstract summary: This study presents a novel framework for text clustering that effectively leverages the in-context learning capacity of Large Language Models (LLMs)
Instead of fine-tuning embedders, we propose to transform the text clustering into a classification task via LLM.
Our framework has been experimentally proven to achieve comparable or superior performance to state-of-the-art clustering methods.
- Score: 6.030435811868953
- License:
- Abstract: Text clustering remains valuable in real-world applications where manual labeling is cost-prohibitive. It facilitates efficient organization and analysis of information by grouping similar texts based on their representations. However, implementing this approach necessitates fine-tuned embedders for downstream data and sophisticated similarity metrics. To address this issue, this study presents a novel framework for text clustering that effectively leverages the in-context learning capacity of Large Language Models (LLMs). Instead of fine-tuning embedders, we propose to transform the text clustering into a classification task via LLM. First, we prompt LLM to generate potential labels for a given dataset. Second, after integrating similar labels generated by the LLM, we prompt the LLM to assign the most appropriate label to each sample in the dataset. Our framework has been experimentally proven to achieve comparable or superior performance to state-of-the-art clustering methods that employ embeddings, without requiring complex fine-tuning or clustering algorithms. We make our code available to the public for utilization at https://github.com/ECNU-Text-Computing/Text-Clustering-via-LLM.
Related papers
- Idiosyncrasies in Large Language Models [54.26923012617675]
We unveil and study idiosyncrasies in Large Language Models (LLMs)
We find that fine-tuning existing text embedding models on LLM-generated texts yields excellent classification accuracy.
We leverage LLM as judges to generate detailed, open-ended descriptions of each model's idiosyncrasies.
arXiv Detail & Related papers (2025-02-17T18:59:02Z) - k-LLMmeans: Summaries as Centroids for Interpretable and Scalable LLM-Based Text Clustering [0.0]
We introduce k-LLMmeans, a novel modification of the k-means clustering algorithm that utilizes LLMs to generate textual summaries as cluster centroids.
This modification preserves the properties of k-means while offering greater interpretability.
We present a case study showcasing the interpretability of evolving cluster centroids in sequential text streams.
arXiv Detail & Related papers (2025-02-12T19:50:22Z) - On Unsupervised Prompt Learning for Classification with Black-box Language Models [71.60563181678323]
Large language models (LLMs) have achieved impressive success in text-formatted learning problems.
LLMs can label datasets with even better quality than skilled human annotators.
In this paper, we propose unsupervised prompt learning for classification with black-box LLMs.
arXiv Detail & Related papers (2024-10-04T03:39:28Z) - Adaptable and Reliable Text Classification using Large Language Models [7.962669028039958]
This paper introduces an adaptable and reliable text classification paradigm, which leverages Large Language Models (LLMs)
We evaluated the performance of several LLMs, machine learning algorithms, and neural network-based architectures on four diverse datasets.
It is shown that the system's performance can be further enhanced through few-shot or fine-tuning strategies.
arXiv Detail & Related papers (2024-05-17T04:05:05Z) - PromptReps: Prompting Large Language Models to Generate Dense and Sparse Representations for Zero-Shot Document Retrieval [76.50690734636477]
We propose PromptReps, which combines the advantages of both categories: no need for training and the ability to retrieve from the whole corpus.
The retrieval system harnesses both dense text embedding and sparse bag-of-words representations.
arXiv Detail & Related papers (2024-04-29T04:51:30Z) - Zero-Shot Topic Classification of Column Headers: Leveraging LLMs for Metadata Enrichment [0.0]
We propose a method to support metadata enrichment using topic annotations generated by three Large Language Models (LLMs): ChatGPT-3.5, GoogleBard, and GoogleGemini.
We evaluate the impact of contextual information (i.e., dataset description) on the classification outcomes.
arXiv Detail & Related papers (2024-03-01T10:01:36Z) - Learning to Prompt with Text Only Supervision for Vision-Language Models [107.282881515667]
One branch of methods adapts CLIP by learning prompts using visual information.
An alternative approach resorts to training-free methods by generating class descriptions from large language models.
We propose to combine the strengths of both streams by learning prompts using only text data.
arXiv Detail & Related papers (2024-01-04T18:59:49Z) - Generalized Category Discovery with Clustering Assignment Consistency [56.92546133591019]
Generalized category discovery (GCD) is a recently proposed open-world task.
We propose a co-training-based framework that encourages clustering consistency.
Our method achieves state-of-the-art performance on three generic benchmarks and three fine-grained visual recognition datasets.
arXiv Detail & Related papers (2023-10-30T00:32:47Z) - Large Language Models Enable Few-Shot Clustering [88.06276828752553]
We show that large language models can amplify an expert's guidance to enable query-efficient, few-shot semi-supervised text clustering.
We find incorporating LLMs in the first two stages can routinely provide significant improvements in cluster quality.
arXiv Detail & Related papers (2023-07-02T09:17:11Z) - Beyond prompting: Making Pre-trained Language Models Better Zero-shot
Learners by Clustering Representations [24.3378487252621]
We show that zero-shot text classification can be improved simply by clustering texts in the embedding spaces of pre-trained language models.
Our approach achieves an average of 20% absolute improvement over prompt-based zero-shot learning.
arXiv Detail & Related papers (2022-10-29T16:01:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.