Related papers: Text clustering with LLM embeddings

Text clustering with LLM embeddings

URL: http://arxiv.org/abs/2403.15112v3
Date: Thu, 30 May 2024 15:17:55 GMT
Title: Text clustering with LLM embeddings
Authors: Alina Petukhova, João P. Matos-Carvalho, Nuno Fachada,
Abstract summary: We investigate how different textual embeddings and clustering algorithms affect how text datasets are clustered. Findings reveal that LLM embeddings excel at capturing subtleties in structured language, while BERT leads the lightweight options in performance.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Text clustering is an important approach for organising the growing amount of digital content, helping to structure and find hidden patterns in uncategorised data. However, the effectiveness of text clustering heavily relies on the choice of textual embeddings and clustering algorithms. We argue that recent advances in large language models (LLMs) can potentially improve this task. In this research, we investigated how different textual embeddings -- particularly those used in LLMs -- and clustering algorithms affect how text datasets are clustered. A series of experiments were conducted to assess how embeddings influence clustering results, the role played by dimensionality reduction through summarisation, and model size adjustment. Findings reveal that LLM embeddings excel at capturing subtleties in structured language, while BERT leads the lightweight options in performance. In addition, we observe that increasing model dimensionality and employing summarization techniques do not consistently lead to improvements in clustering efficiency, suggesting that these strategies require careful analysis to use in real-life models. These results highlight a complex balance between the need for refined text representation and computational feasibility in text clustering applications. This study extends traditional text clustering frameworks by incorporating embeddings from LLMs, providing a path for improved methodologies, while informing new avenues for future research in various types of textual analysis.

Related papers

Improving Large Vision-Language Models' Understanding for Field Data [62.917026891829025]
We introduce FieldLVLM, a framework designed to improve large vision-language models' understanding of field data.<n>FieldLVLM consists of two main components: a field-aware language generation strategy and a data-compressed multimodal model tuning.<n> Experimental results on newly proposed benchmark datasets demonstrate that FieldLVLM significantly outperforms existing methods in tasks involving scientific field data.
arXiv Detail & Related papers (2025-07-24T11:28:53Z)
An Enhanced Model-based Approach for Short Text Clustering [58.60681789677676]
Short text clustering has become increasingly important with the popularity of social media like Twitter, Google+, and Facebook.<n>Existing methods can be broadly categorized into two paradigms: topic model-based approaches and deep representation learning-based approaches.<n>We propose a collapsed Gibbs Sampling algorithm for the Dirichlet Multinomial Mixture model (GSDMM), which effectively handles the sparsity and high dimensionality of short texts.<n>Based on several aspects of GSDMM that warrant further refinement, we propose an improved approach, GSDMM+, designed to further optimize its performance.
arXiv Detail & Related papers (2025-07-18T10:07:42Z)
An Improved Deep Learning Model for Word Embeddings Based Clustering for Large Text Datasets [0.0]
We present an improved clustering technique for large textual datasets by leveraging fine-tuned word embeddings. We show significant improvements in clustering metrics such as silhouette score, purity, and adjusted rand index (ARI) The proposed technique will help to bridge the gap between semantic understanding and statistical robustness for large-scale text-mining tasks.
arXiv Detail & Related papers (2025-02-22T08:28:41Z)
Evaluating LLM Prompts for Data Augmentation in Multi-label Classification of Ecological Texts [1.565361244756411]
Large language models (LLMs) play a crucial role in natural language processing (NLP) tasks. This study applied prompt-based data augmentation to detect mentions of green practices in Russian social media.
arXiv Detail & Related papers (2024-11-22T12:37:41Z)
ERASMO: Leveraging Large Language Models for Enhanced Clustering Segmentation [0.0]
Cluster analysis plays a crucial role in various domains and applications, such as customer segmentation in marketing. This study introduces ERASMO, a framework designed to fine-tune a pretrained language model on textually encoded data.
arXiv Detail & Related papers (2024-10-01T00:37:16Z)
Text Clustering as Classification with LLMs [6.030435811868953]
This study presents a novel framework for text clustering that effectively leverages the in-context learning capacity of Large Language Models (LLMs) Instead of fine-tuning embedders, we propose to transform the text clustering into a classification task via LLM. Our framework has been experimentally proven to achieve comparable or superior performance to state-of-the-art clustering methods.
arXiv Detail & Related papers (2024-09-30T16:57:34Z)
Towards a Unified View of Preference Learning for Large Language Models: A Survey [88.66719962576005]
Large Language Models (LLMs) exhibit remarkably powerful capabilities. One of the crucial factors to achieve success is aligning the LLM's output with human preferences. We decompose all the strategies in preference learning into four components: model, data, feedback, and algorithm.
arXiv Detail & Related papers (2024-09-04T15:11:55Z)
Context-Aware Clustering using Large Language Models [20.971691166166547]
We propose CACTUS (Context-Aware ClusTering with aUgmented triplet losS) for efficient and effective supervised clustering of entity subsets. This paper introduces a novel approach towards clustering entity subsets using Large Language Models (LLMs) by capturing context via a scalable inter-entity attention mechanism.
arXiv Detail & Related papers (2024-05-02T03:50:31Z)
Contextualization Distillation from Large Language Model for Knowledge Graph Completion [51.126166442122546]
We introduce the Contextualization Distillation strategy, a plug-in-and-play approach compatible with both discriminative and generative KGC frameworks. Our method begins by instructing large language models to transform compact, structural triplets into context-rich segments. Comprehensive evaluations across diverse datasets and KGC techniques highlight the efficacy and adaptability of our approach.
arXiv Detail & Related papers (2024-01-28T08:56:49Z)
Incremental hierarchical text clustering methods: a review [49.32130498861987]
This study aims to analyze various hierarchical and incremental clustering techniques. The main contribution of this research is the organization and comparison of the techniques used by studies published between 2010 and 2018 that aimed to texts documents clustering.
arXiv Detail & Related papers (2023-12-12T22:27:29Z)
Large Language Models can Contrastively Refine their Generation for Better Sentence Representation Learning [57.74233319453229]
Large language models (LLMs) have emerged as a groundbreaking technology and their unparalleled text generation capabilities have sparked interest in their application to the fundamental sentence representation learning task. We propose MultiCSR, a multi-level contrastive sentence representation learning framework that decomposes the process of prompting LLMs to generate a corpus. Our experiments reveal that MultiCSR enables a less advanced LLM to surpass the performance of ChatGPT, while applying it to ChatGPT achieves better state-of-the-art results.
arXiv Detail & Related papers (2023-10-17T03:21:43Z)
CEIL: A General Classification-Enhanced Iterative Learning Framework for Text Clustering [16.08402937918212]
We propose a novel Classification-Enhanced Iterative Learning framework for short text clustering. In each iteration, we first adopt a language model to retrieve the initial text representations. After strict data filtering and aggregation processes, samples with clean category labels are retrieved, which serve as supervision information. Finally, the updated language model with improved representation ability is used to enhance clustering in the next iteration.
arXiv Detail & Related papers (2023-04-20T14:04:31Z)
Unified Multi-View Orthonormal Non-Negative Graph Based Clustering Framework [74.25493157757943]
We formulate a novel clustering model, which exploits the non-negative feature property and incorporates the multi-view information into a unified joint learning framework. We also explore, for the first time, the multi-model non-negative graph-based approach to clustering data based on deep features.
arXiv Detail & Related papers (2022-11-03T08:18:27Z)
SDA: Improving Text Generation with Self Data Augmentation [88.24594090105899]
We propose to improve the standard maximum likelihood estimation (MLE) paradigm by incorporating a self-imitation-learning phase for automatic data augmentation. Unlike most existing sentence-level augmentation strategies, our method is more general and could be easily adapted to any MLE-based training procedure.
arXiv Detail & Related papers (2021-01-02T01:15:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.