Related papers: Text Clustering as Classification with LLMs

Text Clustering as Classification with LLMs

URL: http://arxiv.org/abs/2410.00927v2
Date: Thu, 02 Jan 2025 08:53:50 GMT
Title: Text Clustering as Classification with LLMs
Authors: Chen Huang, Guoxiu He,
Abstract summary: This study presents a novel framework for text clustering that effectively leverages the in-context learning capacity of Large Language Models (LLMs)<n>Instead of fine-tuning embedders, we propose to transform the text clustering into a classification task via LLM.<n>Our framework has been experimentally proven to achieve comparable or superior performance to state-of-the-art clustering methods.
Score: 6.030435811868953
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Text clustering remains valuable in real-world applications where manual labeling is cost-prohibitive. It facilitates efficient organization and analysis of information by grouping similar texts based on their representations. However, implementing this approach necessitates fine-tuned embedders for downstream data and sophisticated similarity metrics. To address this issue, this study presents a novel framework for text clustering that effectively leverages the in-context learning capacity of Large Language Models (LLMs). Instead of fine-tuning embedders, we propose to transform the text clustering into a classification task via LLM. First, we prompt LLM to generate potential labels for a given dataset. Second, after integrating similar labels generated by the LLM, we prompt the LLM to assign the most appropriate label to each sample in the dataset. Our framework has been experimentally proven to achieve comparable or superior performance to state-of-the-art clustering methods that employ embeddings, without requiring complex fine-tuning or clustering algorithms. We make our code available to the public for utilization at https://github.com/ECNU-Text-Computing/Text-Clustering-via-LLM.

Related papers

Cost-Effective Text Clustering with Large Language Models [15.179854529085544]
This paper proposes TECL, a cost-effective framework that taps into the feedback from large language models for accurate text clustering. Under the hood, TECL adopts our EdgeLLM or TriangleLLM to construct must-link/cannot-link constraints for text pairs. Our experiments on multiple benchmark datasets exhibit that TECL consistently and considerably outperforms existing solutions in unsupervised text clustering.
arXiv Detail & Related papers (2025-04-22T06:57:49Z)
Training Large Recommendation Models via Graph-Language Token Alignment [53.3142545812349]
We propose a novel framework to train Large Recommendation models via Graph-Language Token Alignment. By aligning item and user nodes from the interaction graph with pretrained LLM tokens, GLTA effectively leverages the reasoning abilities of LLMs. Furthermore, we introduce Graph-Language Logits Matching (GLLM) to optimize token alignment for end-to-end item prediction.
arXiv Detail & Related papers (2025-02-26T02:19:10Z)
Idiosyncrasies in Large Language Models [54.26923012617675]
We unveil and study idiosyncrasies in Large Language Models (LLMs) We find that fine-tuning existing text embedding models on LLM-generated texts yields excellent classification accuracy. We leverage LLM as judges to generate detailed, open-ended descriptions of each model's idiosyncrasies.
arXiv Detail & Related papers (2025-02-17T18:59:02Z)
k-LLMmeans: Summaries as Centroids for Interpretable and Scalable LLM-Based Text Clustering [0.0]
We introduce k-LLMmeans, a novel modification of the k-means clustering algorithm that utilizes LLMs to generate textual summaries as cluster centroids. This modification preserves the properties of k-means while offering greater interpretability. We present a case study showcasing the interpretability of evolving cluster centroids in sequential text streams.
arXiv Detail & Related papers (2025-02-12T19:50:22Z)
On Unsupervised Prompt Learning for Classification with Black-box Language Models [71.60563181678323]
Large language models (LLMs) have achieved impressive success in text-formatted learning problems. LLMs can label datasets with even better quality than skilled human annotators. In this paper, we propose unsupervised prompt learning for classification with black-box LLMs.
arXiv Detail & Related papers (2024-10-04T03:39:28Z)
Zero-to-Strong Generalization: Eliciting Strong Capabilities of Large Language Models Iteratively without Gold Labels [75.77877889764073]
Large Language Models (LLMs) have demonstrated remarkable performance through supervised fine-tuning or in-context learning using gold labels. This study explores whether solely utilizing unlabeled data can elicit strong model capabilities. We propose a new paradigm termed zero-to-strong generalization.
arXiv Detail & Related papers (2024-09-19T02:59:44Z)
Adaptable and Reliable Text Classification using Large Language Models [7.962669028039958]
This paper introduces an adaptable and reliable text classification paradigm, which leverages Large Language Models (LLMs) We evaluated the performance of several LLMs, machine learning algorithms, and neural network-based architectures on four diverse datasets. It is shown that the system's performance can be further enhanced through few-shot or fine-tuning strategies.
arXiv Detail & Related papers (2024-05-17T04:05:05Z)
PromptReps: Prompting Large Language Models to Generate Dense and Sparse Representations for Zero-Shot Document Retrieval [76.50690734636477]
We propose PromptReps, which combines the advantages of both categories: no need for training and the ability to retrieve from the whole corpus. The retrieval system harnesses both dense text embedding and sparse bag-of-words representations.
arXiv Detail & Related papers (2024-04-29T04:51:30Z)
Text Clustering with LLM Embeddings [0.0]
The effectiveness of text clustering largely depends on the selection of textual embeddings and clustering algorithms. Recent advancements in large language models (LLMs) have the potential to enhance this task. Findings indicate that LLM embeddings are superior at capturing subtleties in structured language.
arXiv Detail & Related papers (2024-03-22T11:08:48Z)
Zero-Shot Topic Classification of Column Headers: Leveraging LLMs for Metadata Enrichment [0.0]
We propose a method to support metadata enrichment using topic annotations generated by three Large Language Models (LLMs): ChatGPT-3.5, GoogleBard, and GoogleGemini. We evaluate the impact of contextual information (i.e., dataset description) on the classification outcomes.
arXiv Detail & Related papers (2024-03-01T10:01:36Z)
Learning to Prompt with Text Only Supervision for Vision-Language Models [107.282881515667]
One branch of methods adapts CLIP by learning prompts using visual information. An alternative approach resorts to training-free methods by generating class descriptions from large language models. We propose to combine the strengths of both streams by learning prompts using only text data.
arXiv Detail & Related papers (2024-01-04T18:59:49Z)
Generalized Category Discovery with Clustering Assignment Consistency [56.92546133591019]
Generalized category discovery (GCD) is a recently proposed open-world task. We propose a co-training-based framework that encourages clustering consistency. Our method achieves state-of-the-art performance on three generic benchmarks and three fine-grained visual recognition datasets.
arXiv Detail & Related papers (2023-10-30T00:32:47Z)
Large Language Models Enable Few-Shot Clustering [88.06276828752553]
We show that large language models can amplify an expert's guidance to enable query-efficient, few-shot semi-supervised text clustering. We find incorporating LLMs in the first two stages can routinely provide significant improvements in cluster quality.
arXiv Detail & Related papers (2023-07-02T09:17:11Z)
Beyond prompting: Making Pre-trained Language Models Better Zero-shot Learners by Clustering Representations [24.3378487252621]
We show that zero-shot text classification can be improved simply by clustering texts in the embedding spaces of pre-trained language models. Our approach achieves an average of 20% absolute improvement over prompt-based zero-shot learning.
arXiv Detail & Related papers (2022-10-29T16:01:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.