Related papers: Cost-Effective Text Clustering with Large Language Models

Cost-Effective Text Clustering with Large Language Models

URL: http://arxiv.org/abs/2504.15640v1
Date: Tue, 22 Apr 2025 06:57:49 GMT
Title: Cost-Effective Text Clustering with Large Language Models
Authors: Hongtao Wang, Taiyan Zhang, Renchi Yang, Jianliang Xu,
Abstract summary: This paper proposes TECL, a cost-effective framework that taps into the feedback from large language models for accurate text clustering.<n>Under the hood, TECL adopts our EdgeLLM or TriangleLLM to construct must-link/cannot-link constraints for text pairs.<n>Our experiments on multiple benchmark datasets exhibit that TECL consistently and considerably outperforms existing solutions in unsupervised text clustering.
Score: 15.179854529085544
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Text clustering aims to automatically partition a collection of text documents into distinct clusters based on linguistic features. In the literature, this task is usually framed as metric clustering based on text embeddings from pre-trained encoders or a graph clustering problem upon pairwise similarities from an oracle, e.g., a large ML model. Recently, large language models (LLMs) bring significant advancement in this field by offering contextualized text embeddings and highly accurate similarity scores, but meanwhile, present grand challenges to cope with substantial computational and/or financial overhead caused by numerous API-based queries or inference calls to the models. In response, this paper proposes TECL, a cost-effective framework that taps into the feedback from LLMs for accurate text clustering within a limited budget of queries to LLMs. Under the hood, TECL adopts our EdgeLLM or TriangleLLM to construct must-link/cannot-link constraints for text pairs, and further leverages such constraints as supervision signals input to our weighted constrained clustering approach to generate clusters. Particularly, EdgeLLM (resp. TriangleLLM) enables the identification of informative text pairs (resp. triplets) for querying LLMs via well-thought-out greedy algorithms and accurate extraction of pairwise constraints through carefully-crafted prompting techniques. Our experiments on multiple benchmark datasets exhibit that TECL consistently and considerably outperforms existing solutions in unsupervised text clustering under the same query cost for LLMs.

Related papers

Idiosyncrasies in Large Language Models [54.26923012617675]
We unveil and study idiosyncrasies in Large Language Models (LLMs) We find that fine-tuning existing text embedding models on LLM-generated texts yields excellent classification accuracy. We leverage LLM as judges to generate detailed, open-ended descriptions of each model's idiosyncrasies.
arXiv Detail & Related papers (2025-02-17T18:59:02Z)
k-LLMmeans: Summaries as Centroids for Interpretable and Scalable LLM-Based Text Clustering [0.0]
We introduce k-LLMmeans, a novel modification of the k-means clustering algorithm that utilizes LLMs to generate textual summaries as cluster centroids. This modification preserves the properties of k-means while offering greater interpretability. We present a case study showcasing the interpretability of evolving cluster centroids in sequential text streams.
arXiv Detail & Related papers (2025-02-12T19:50:22Z)
TableTime: Reformulating Time Series Classification as Training-Free Table Understanding with Large Language Models [14.880203496664963]
Large language models (LLMs) have demonstrated their effectiveness in multivariate time series classification.<n>LLMs directly encode embeddings for time series within the latent space of LLMs from scratch to align with semantic space of LLMs.<n>We propose TableTime, which reformulates MTSC as a table understanding task.
arXiv Detail & Related papers (2024-11-24T07:02:32Z)
LLM$\times$MapReduce: Simplified Long-Sequence Processing using Large Language Models [73.13933847198395]
We propose a training-free framework for processing long texts, utilizing a divide-and-conquer strategy to achieve comprehensive document understanding. The proposed LLM$times$MapReduce framework splits the entire document into several chunks for LLMs to read and then aggregates the intermediate answers to produce the final output.
arXiv Detail & Related papers (2024-10-12T03:13:44Z)
Text Clustering as Classification with LLMs [6.030435811868953]
This study presents a novel framework for text clustering that effectively leverages the in-context learning capacity of Large Language Models (LLMs)<n>Instead of fine-tuning embedders, we propose to transform the text clustering into a classification task via LLM.<n>Our framework has been experimentally proven to achieve comparable or superior performance to state-of-the-art clustering methods.
arXiv Detail & Related papers (2024-09-30T16:57:34Z)
TACT: Advancing Complex Aggregative Reasoning with Information Extraction Tools [51.576974932743596]
Large Language Models (LLMs) often do not perform well on queries that require the aggregation of information across texts. TACT contains challenging instructions that demand stitching information scattered across one or more texts. We construct this dataset by leveraging an existing dataset of texts and their associated tables. We demonstrate that all contemporary LLMs perform poorly on this dataset, achieving an accuracy below 38%.
arXiv Detail & Related papers (2024-06-05T20:32:56Z)
Human-interpretable clustering of short-text using large language models [0.0]
This work shows that large language models (LLMs) can overcome the limitations of traditional clustering approaches. The resulting clusters are found to be more distinctive and more human-interpretable.
arXiv Detail & Related papers (2024-05-12T12:55:40Z)
Context-Aware Clustering using Large Language Models [20.971691166166547]
We propose CACTUS (Context-Aware ClusTering with aUgmented triplet losS) for efficient and effective supervised clustering of entity subsets. This paper introduces a novel approach towards clustering entity subsets using Large Language Models (LLMs) by capturing context via a scalable inter-entity attention mechanism.
arXiv Detail & Related papers (2024-05-02T03:50:31Z)
Text Clustering with Large Language Model Embeddings [0.0]
The effectiveness of text clustering largely depends on the selection of textual embeddings and clustering algorithms.<n>Recent advancements in large language models (LLMs) have the potential to enhance this task.<n>Findings indicate that LLM embeddings are superior at capturing subtleties in structured language.
arXiv Detail & Related papers (2024-03-22T11:08:48Z)
FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models [79.62191017182518]
FollowBench is a benchmark for Fine-grained Constraints Following Benchmark for Large Language Models. We introduce a Multi-level mechanism that incrementally adds a single constraint to the initial instruction at each increased level. By evaluating 13 popular LLMs on FollowBench, we highlight the weaknesses of LLMs in instruction following and point towards potential avenues for future work.
arXiv Detail & Related papers (2023-10-31T12:32:38Z)
Evaluating, Understanding, and Improving Constrained Text Generation for Large Language Models [49.74036826946397]
This study investigates constrained text generation for large language models (LLMs) Our research mainly focuses on mainstream open-source LLMs, categorizing constraints into lexical, structural, and relation-based types. Results illuminate LLMs' capacity and deficiency to incorporate constraints and provide insights for future developments in constrained text generation.
arXiv Detail & Related papers (2023-10-25T03:58:49Z)
Large Language Models Enable Few-Shot Clustering [88.06276828752553]
We show that large language models can amplify an expert's guidance to enable query-efficient, few-shot semi-supervised text clustering. We find incorporating LLMs in the first two stages can routinely provide significant improvements in cluster quality.
arXiv Detail & Related papers (2023-07-02T09:17:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.