Cequel: Cost-Effective Querying of Large Language Models for Text Clustering
- URL: http://arxiv.org/abs/2504.15640v2
- Date: Thu, 21 Aug 2025 06:36:11 GMT
- Title: Cequel: Cost-Effective Querying of Large Language Models for Text Clustering
- Authors: Hongtao Wang, Taiyan Zhang, Renchi Yang, Jianliang Xu,
- Abstract summary: Text clustering aims to automatically partition a collection of documents into coherent groups based on their linguistic features.<n>Recent advances in large language models (LLMs) have significantly improved this field by providing high-quality contextualized embeddings.<n>We propose Cequel, a cost-effective framework that achieves accurate text clustering under a limited budget of LLM queries.
- Score: 15.179854529085544
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text clustering aims to automatically partition a collection of documents into coherent groups based on their linguistic features. In the literature, this task is formulated either as metric clustering over pre-trained text embeddings or as graph clustering based on pairwise similarities derived from an oracle, e.g., a large machine learning model. Recent advances in large language models (LLMs) have significantly improved this field by providing high-quality contextualized embeddings and accurate semantic similarity estimates. However, leveraging LLMs at scale introduces substantial computational and financial costs due to the large number of required API queries or inference calls. To address this issue, we propose Cequel, a cost-effective framework that achieves accurate text clustering under a limited budget of LLM queries. At its core, Cequel constructs must-link and cannot-link constraints by selectively querying LLMs on informative text pairs or triplets, identified via our proposed algorithms, EdgeLLM and TriangleLLM. These constraints are then utilized in a weighted constrained clustering algorithm to form high-quality clusters. Specifically, EdgeLLM and TriangleLLM employ carefully designed greedy selection strategies and prompting techniques to identify and extract informative constraints efficiently. Experiments on multiple benchmark datasets demonstrate that Cequel consistently outperforms existing methods in unsupervised text clustering under the same query budget.
Related papers
- Optimized Algorithms for Text Clustering with LLM-Generated Constraints [9.075693512125042]
Many researchers have incorporated background knowledge, typically in the form of must-link and cannot-link constraints, to guide the clustering process.<n>With the recent advent of large language models (LLMs), there is growing interest in improving clustering quality.<n>We propose a novel constraint-generation approach that reduces resource consumption by generating constraint sets rather than using traditional pairwise constraints.
arXiv Detail & Related papers (2026-01-16T09:26:37Z) - ClusterFusion: Hybrid Clustering with Embedding Guidance and LLM Adaptation [52.794544682493814]
Large language models (LLMs) provide strong contextual reasoning, yet prior work mainly uses them as auxiliary modules to refine embeddings or adjust cluster boundaries.<n>We propose ClusterFusion, a hybrid framework that treats the LLM as the clustering core, guided by lightweight embedding methods.<n> Experiments on three public benchmarks and two new domain-specific datasets demonstrate that ClusterFusion achieves state-of-the-art performance on standard tasks.
arXiv Detail & Related papers (2025-12-04T00:49:43Z) - ESMC: MLLM-Based Embedding Selection for Explainable Multiple Clustering [79.69917150582633]
Multi-modal large language models (MLLMs) can be leveraged to achieve user-driven clustering.<n>Our method first discovers that MLLMs' hidden states of text tokens are strongly related to the corresponding features.<n>We also employ a lightweight clustering head augmented with pseudo-label learning, significantly enhancing clustering accuracy.
arXiv Detail & Related papers (2025-11-30T04:36:51Z) - LLM-MemCluster: Empowering Large Language Models with Dynamic Memory for Text Clustering [52.41664454251679]
Large Language Models (LLMs) are reshaping unsupervised learning by offering an unprecedented ability to perform text clustering.<n>Existing methods often rely on complex pipelines with external modules, sacrificing a truly end-to-end approach.<n>We introduce LLM-MemCluster, a novel framework that reconceptualizes clustering as a fully LLM-native task.
arXiv Detail & Related papers (2025-11-19T13:22:08Z) - In-Context Clustering with Large Language Models [50.25868718329313]
ICC captures complex relationships among inputs through an attention mechanism.<n>We show that pretrained LLMs exhibit impressive zero-shot clustering capabilities on text-encoded numeric data.<n>Our work extends in-context learning to an unsupervised setting, showcasing the effectiveness and flexibility of LLMs for clustering.
arXiv Detail & Related papers (2025-10-09T17:07:55Z) - An Enhanced Model-based Approach for Short Text Clustering [58.60681789677676]
Short text clustering has become increasingly important with the popularity of social media like Twitter, Google+, and Facebook.<n>Existing methods can be broadly categorized into two paradigms: topic model-based approaches and deep representation learning-based approaches.<n>We propose a collapsed Gibbs Sampling algorithm for the Dirichlet Multinomial Mixture model (GSDMM), which effectively handles the sparsity and high dimensionality of short texts.<n>Based on several aspects of GSDMM that warrant further refinement, we propose an improved approach, GSDMM+, designed to further optimize its performance.
arXiv Detail & Related papers (2025-07-18T10:07:42Z) - LLM-Symbolic Integration for Robust Temporal Tabular Reasoning [69.27153114778748]
We introduce TempTabQA-C, a synthetic dataset designed for systematic and controlled evaluations.<n>This structured approach allows Large Language Models (LLMs) to generate and executesql queries, enhancing generalization and mitigating biases.
arXiv Detail & Related papers (2025-06-06T05:14:04Z) - Large Language Model Meets Constraint Propagation [42.54350568915404]
GenCP combines predictions with Constraint Programming (CP) reasoning to generate fluent text.<n>We improve GenCP by integrating Masked Language Models (MLMs) for domain generation, which allows constraint propagation.<n>Our evaluation on COLLIE benchmarks demonstrates that incorporating domain preview via bidirectional calls significantly improves GenCP's performance.
arXiv Detail & Related papers (2025-05-29T21:18:12Z) - Idiosyncrasies in Large Language Models [54.26923012617675]
We unveil and study idiosyncrasies in Large Language Models (LLMs)
We find that fine-tuning existing text embedding models on LLM-generated texts yields excellent classification accuracy.
We leverage LLM as judges to generate detailed, open-ended descriptions of each model's idiosyncrasies.
arXiv Detail & Related papers (2025-02-17T18:59:02Z) - k-LLMmeans: Summaries as Centroids for Interpretable and Scalable LLM-Based Text Clustering [0.0]
We introduce k-LLMmeans, a novel modification of the k-means clustering algorithm that utilizes LLMs to generate textual summaries as cluster centroids.
This modification preserves the properties of k-means while offering greater interpretability.
We present a case study showcasing the interpretability of evolving cluster centroids in sequential text streams.
arXiv Detail & Related papers (2025-02-12T19:50:22Z) - TableTime: Reformulating Time Series Classification as Training-Free Table Understanding with Large Language Models [14.880203496664963]
Large language models (LLMs) have demonstrated their effectiveness in multivariate time series classification.<n>LLMs directly encode embeddings for time series within the latent space of LLMs from scratch to align with semantic space of LLMs.<n>We propose TableTime, which reformulates MTSC as a table understanding task.
arXiv Detail & Related papers (2024-11-24T07:02:32Z) - LLM$\times$MapReduce: Simplified Long-Sequence Processing using Large Language Models [73.13933847198395]
We propose a training-free framework for processing long texts, utilizing a divide-and-conquer strategy to achieve comprehensive document understanding.
The proposed LLM$times$MapReduce framework splits the entire document into several chunks for LLMs to read and then aggregates the intermediate answers to produce the final output.
arXiv Detail & Related papers (2024-10-12T03:13:44Z) - Text Clustering as Classification with LLMs [6.030435811868953]
This study presents a novel framework for text clustering that effectively leverages the in-context learning capacity of Large Language Models (LLMs)<n>Instead of fine-tuning embedders, we propose to transform the text clustering into a classification task via LLM.<n>Our framework has been experimentally proven to achieve comparable or superior performance to state-of-the-art clustering methods.
arXiv Detail & Related papers (2024-09-30T16:57:34Z) - TACT: Advancing Complex Aggregative Reasoning with Information Extraction Tools [51.576974932743596]
Large Language Models (LLMs) often do not perform well on queries that require the aggregation of information across texts.
TACT contains challenging instructions that demand stitching information scattered across one or more texts.
We construct this dataset by leveraging an existing dataset of texts and their associated tables.
We demonstrate that all contemporary LLMs perform poorly on this dataset, achieving an accuracy below 38%.
arXiv Detail & Related papers (2024-06-05T20:32:56Z) - Human-interpretable clustering of short-text using large language models [0.0]
This work shows that large language models (LLMs) can overcome the limitations of traditional clustering approaches.
The resulting clusters are found to be more distinctive and more human-interpretable.
arXiv Detail & Related papers (2024-05-12T12:55:40Z) - Context-Aware Clustering using Large Language Models [20.971691166166547]
We propose CACTUS (Context-Aware ClusTering with aUgmented triplet losS) for efficient and effective supervised clustering of entity subsets.
This paper introduces a novel approach towards clustering entity subsets using Large Language Models (LLMs) by capturing context via a scalable inter-entity attention mechanism.
arXiv Detail & Related papers (2024-05-02T03:50:31Z) - Text Clustering with Large Language Model Embeddings [0.0]
The effectiveness of text clustering largely depends on the selection of textual embeddings and clustering algorithms.<n>Recent advancements in large language models (LLMs) have the potential to enhance this task.<n>Findings indicate that LLM embeddings are superior at capturing subtleties in structured language.
arXiv Detail & Related papers (2024-03-22T11:08:48Z) - FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models [79.62191017182518]
FollowBench is a benchmark for Fine-grained Constraints Following Benchmark for Large Language Models.
We introduce a Multi-level mechanism that incrementally adds a single constraint to the initial instruction at each increased level.
By evaluating 13 popular LLMs on FollowBench, we highlight the weaknesses of LLMs in instruction following and point towards potential avenues for future work.
arXiv Detail & Related papers (2023-10-31T12:32:38Z) - Evaluating, Understanding, and Improving Constrained Text Generation for Large Language Models [49.74036826946397]
This study investigates constrained text generation for large language models (LLMs)
Our research mainly focuses on mainstream open-source LLMs, categorizing constraints into lexical, structural, and relation-based types.
Results illuminate LLMs' capacity and deficiency to incorporate constraints and provide insights for future developments in constrained text generation.
arXiv Detail & Related papers (2023-10-25T03:58:49Z) - Large Language Models Enable Few-Shot Clustering [88.06276828752553]
We show that large language models can amplify an expert's guidance to enable query-efficient, few-shot semi-supervised text clustering.
We find incorporating LLMs in the first two stages can routinely provide significant improvements in cluster quality.
arXiv Detail & Related papers (2023-07-02T09:17:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.