Optimized Algorithms for Text Clustering with LLM-Generated Constraints
- URL: http://arxiv.org/abs/2601.11118v1
- Date: Fri, 16 Jan 2026 09:26:37 GMT
- Title: Optimized Algorithms for Text Clustering with LLM-Generated Constraints
- Authors: Chaoqi Jia, Weihong Wu, Longkun Guo, Zhigang Lu, Chao Chen, Kok-Leong Ong,
- Abstract summary: Many researchers have incorporated background knowledge, typically in the form of must-link and cannot-link constraints, to guide the clustering process.<n>With the recent advent of large language models (LLMs), there is growing interest in improving clustering quality.<n>We propose a novel constraint-generation approach that reduces resource consumption by generating constraint sets rather than using traditional pairwise constraints.
- Score: 9.075693512125042
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Clustering is a fundamental tool that has garnered significant interest across a wide range of applications including text analysis. To improve clustering accuracy, many researchers have incorporated background knowledge, typically in the form of must-link and cannot-link constraints, to guide the clustering process. With the recent advent of large language models (LLMs), there is growing interest in improving clustering quality through LLM-based automatic constraint generation. In this paper, we propose a novel constraint-generation approach that reduces resource consumption by generating constraint sets rather than using traditional pairwise constraints. This approach improves both query efficiency and constraint accuracy compared to state-of-the-art methods. We further introduce a constrained clustering algorithm tailored to the characteristics of LLM-generated constraints. Our method incorporates a confidence threshold and a penalty mechanism to address potentially inaccurate constraints. We evaluate our approach on five text datasets, considering both the cost of constraint generation and the overall clustering performance. The results show that our method achieves clustering accuracy comparable to the state-of-the-art algorithms while reducing the number of LLM queries by more than 20 times.
Related papers
- LLM-MemCluster: Empowering Large Language Models with Dynamic Memory for Text Clustering [52.41664454251679]
Large Language Models (LLMs) are reshaping unsupervised learning by offering an unprecedented ability to perform text clustering.<n>Existing methods often rely on complex pipelines with external modules, sacrificing a truly end-to-end approach.<n>We introduce LLM-MemCluster, a novel framework that reconceptualizes clustering as a fully LLM-native task.
arXiv Detail & Related papers (2025-11-19T13:22:08Z) - In-Context Clustering with Large Language Models [50.25868718329313]
ICC captures complex relationships among inputs through an attention mechanism.<n>We show that pretrained LLMs exhibit impressive zero-shot clustering capabilities on text-encoded numeric data.<n>Our work extends in-context learning to an unsupervised setting, showcasing the effectiveness and flexibility of LLMs for clustering.
arXiv Detail & Related papers (2025-10-09T17:07:55Z) - Cequel: Cost-Effective Querying of Large Language Models for Text Clustering [15.179854529085544]
Text clustering aims to automatically partition a collection of documents into coherent groups based on their linguistic features.<n>Recent advances in large language models (LLMs) have significantly improved this field by providing high-quality contextualized embeddings.<n>We propose Cequel, a cost-effective framework that achieves accurate text clustering under a limited budget of LLM queries.
arXiv Detail & Related papers (2025-04-22T06:57:49Z) - CLCR: Contrastive Learning-based Constraint Reordering for Efficient MILP Solving [34.127805466651864]
Constraint ordering plays a critical role in the efficiency of Mixed-Integer Linear Programming (MILP) solvers.<n>This paper introduces CLCR (Contrastive Learning-based Constraint Reordering), a novel framework that systematically optimize constraint ordering to accelerate MILP solving.<n> Experiments on benchmarks show CLCR reduces solving time by 30% and LP iterations by 25% on average, without sacrificing solution accuracy.
arXiv Detail & Related papers (2025-03-23T05:01:43Z) - Revisiting Self-Supervised Heterogeneous Graph Learning from Spectral Clustering Perspective [52.662463893268225]
Self-supervised heterogeneous graph learning (SHGL) has shown promising potential in diverse scenarios.<n>Existing SHGL methods encounter two significant limitations.<n>We introduce a novel framework enhanced by rank and dual consistency constraints.
arXiv Detail & Related papers (2024-12-01T09:33:20Z) - HAFLQ: Heterogeneous Adaptive Federated LoRA Fine-tuned LLM with Quantization [55.972018549438964]
Federated fine-tuning of pre-trained Large Language Models (LLMs) enables task-specific adaptation across diverse datasets while preserving privacy.<n>We propose HAFLQ (Heterogeneous Adaptive Federated Low-Rank Adaptation Fine-tuned LLM with Quantization), a novel framework for efficient and scalable fine-tuning of LLMs in heterogeneous environments.<n> Experimental results on the text classification task demonstrate that HAFLQ reduces memory usage by 31%, lowers communication cost by 49%, improves accuracy by 50%, and achieves faster convergence compared to the baseline method.
arXiv Detail & Related papers (2024-11-10T19:59:54Z) - Text Clustering as Classification with LLMs [9.128151647718251]
We propose a novel framework that reframes text clustering as a classification task by harnessing the in-context learning capabilities of Large Language Models.<n>By leveraging the advanced natural language understanding and generalization capabilities of LLMs, the proposed approach enables effective clustering with minimal human intervention.<n> Experimental results on diverse datasets demonstrate that our framework achieves comparable or superior performance to state-of-the-art embedding-based clustering techniques.
arXiv Detail & Related papers (2024-09-30T16:57:34Z) - Context-Aware Clustering using Large Language Models [20.971691166166547]
We propose CACTUS (Context-Aware ClusTering with aUgmented triplet losS) for efficient and effective supervised clustering of entity subsets.
This paper introduces a novel approach towards clustering entity subsets using Large Language Models (LLMs) by capturing context via a scalable inter-entity attention mechanism.
arXiv Detail & Related papers (2024-05-02T03:50:31Z) - Large Language Models Enable Few-Shot Clustering [88.06276828752553]
We show that large language models can amplify an expert's guidance to enable query-efficient, few-shot semi-supervised text clustering.
We find incorporating LLMs in the first two stages can routinely provide significant improvements in cluster quality.
arXiv Detail & Related papers (2023-07-02T09:17:11Z) - An Exact Algorithm for Semi-supervised Minimum Sum-of-Squares Clustering [0.5801044612920815]
We present a new branch-and-bound algorithm for semi-supervised MSSC.
Background knowledge is incorporated as pairwise must-link and cannot-link constraints.
For the first time, the proposed global optimization algorithm efficiently manages to solve real-world instances up to 800 data points.
arXiv Detail & Related papers (2021-11-30T17:08:53Z) - Meta Clustering Learning for Large-scale Unsupervised Person
Re-identification [124.54749810371986]
We propose a "small data for big task" paradigm dubbed Meta Clustering Learning (MCL)
MCL only pseudo-labels a subset of the entire unlabeled data via clustering to save computing for the first-phase training.
Our method significantly saves computational cost while achieving a comparable or even better performance compared to prior works.
arXiv Detail & Related papers (2021-11-19T04:10:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.