Text Clustering with LLM Embeddings
- URL: http://arxiv.org/abs/2403.15112v4
- Date: Fri, 9 Aug 2024 16:57:09 GMT
- Title: Text Clustering with LLM Embeddings
- Authors: Alina Petukhova, João P. Matos-Carvalho, Nuno Fachada,
- Abstract summary: The effectiveness of text clustering largely depends on the selection of textual embeddings and clustering algorithms.
Recent advancements in large language models (LLMs) have the potential to enhance this task.
Findings indicate that LLM embeddings are superior at capturing subtleties in structured language.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text clustering is an important method for organising the increasing volume of digital content, aiding in the structuring and discovery of hidden patterns in uncategorised data. The effectiveness of text clustering largely depends on the selection of textual embeddings and clustering algorithms. This study argues that recent advancements in large language models (LLMs) have the potential to enhance this task. The research investigates how different textual embeddings, particularly those utilised in LLMs, and various clustering algorithms influence the clustering of text datasets. A series of experiments were conducted to evaluate the impact of embeddings on clustering results, the role of dimensionality reduction through summarisation, and the adjustment of model size. The findings indicate that LLM embeddings are superior at capturing subtleties in structured language. OpenAI's GPT-3.5 Turbo model yields better results in three out of five clustering metrics across most tested datasets. Most LLM embeddings show improvements in cluster purity and provide a more informative silhouette score, reflecting a refined structural understanding of text data compared to traditional methods. Among the more lightweight models, BERT demonstrates leading performance. Additionally, it was observed that increasing model dimensionality and employing summarisation techniques do not consistently enhance clustering efficiency, suggesting that these strategies require careful consideration for practical application. These results highlight a complex balance between the need for refined text representation and computational feasibility in text clustering applications. This study extends traditional text clustering frameworks by integrating embeddings from LLMs, offering improved methodologies and suggesting new avenues for future research in various types of textual analysis.
Related papers
- ERASMO: Leveraging Large Language Models for Enhanced Clustering Segmentation [0.0]
Cluster analysis plays a crucial role in various domains and applications, such as customer segmentation in marketing.
This study introduces ERASMO, a framework designed to fine-tune a pretrained language model on textually encoded data.
arXiv Detail & Related papers (2024-10-01T00:37:16Z) - Text Clustering as Classification with LLMs [6.030435811868953]
This study presents a novel framework for text clustering that effectively leverages the in-context learning capacity of Large Language Models (LLMs)
Instead of fine-tuning embedders, we propose to transform the text clustering into a classification task via LLM.
Our framework has been experimentally proven to achieve comparable or superior performance to state-of-the-art clustering methods.
arXiv Detail & Related papers (2024-09-30T16:57:34Z) - Towards a Unified View of Preference Learning for Large Language Models: A Survey [88.66719962576005]
Large Language Models (LLMs) exhibit remarkably powerful capabilities.
One of the crucial factors to achieve success is aligning the LLM's output with human preferences.
We decompose all the strategies in preference learning into four components: model, data, feedback, and algorithm.
arXiv Detail & Related papers (2024-09-04T15:11:55Z) - NeurCAM: Interpretable Neural Clustering via Additive Models [3.4437947384641037]
Interpretable clustering algorithms aim to group similar data points while explaining the obtained groups.
We introduce the Neural Clustering Additive Model (NeurCAM), a novel approach to the interpretable clustering problem.
Our approach significantly outperforms other interpretable clustering approaches when clustering on text data.
arXiv Detail & Related papers (2024-08-23T20:32:57Z) - Context-Aware Clustering using Large Language Models [20.971691166166547]
We propose CACTUS (Context-Aware ClusTering with aUgmented triplet losS) for efficient and effective supervised clustering of entity subsets.
This paper introduces a novel approach towards clustering entity subsets using Large Language Models (LLMs) by capturing context via a scalable inter-entity attention mechanism.
arXiv Detail & Related papers (2024-05-02T03:50:31Z) - Contextualization Distillation from Large Language Model for Knowledge
Graph Completion [51.126166442122546]
We introduce the Contextualization Distillation strategy, a plug-in-and-play approach compatible with both discriminative and generative KGC frameworks.
Our method begins by instructing large language models to transform compact, structural triplets into context-rich segments.
Comprehensive evaluations across diverse datasets and KGC techniques highlight the efficacy and adaptability of our approach.
arXiv Detail & Related papers (2024-01-28T08:56:49Z) - Incremental hierarchical text clustering methods: a review [49.32130498861987]
This study aims to analyze various hierarchical and incremental clustering techniques.
The main contribution of this research is the organization and comparison of the techniques used by studies published between 2010 and 2018 that aimed to texts documents clustering.
arXiv Detail & Related papers (2023-12-12T22:27:29Z) - Large Language Models can Contrastively Refine their Generation for Better Sentence Representation Learning [57.74233319453229]
Large language models (LLMs) have emerged as a groundbreaking technology and their unparalleled text generation capabilities have sparked interest in their application to the fundamental sentence representation learning task.
We propose MultiCSR, a multi-level contrastive sentence representation learning framework that decomposes the process of prompting LLMs to generate a corpus.
Our experiments reveal that MultiCSR enables a less advanced LLM to surpass the performance of ChatGPT, while applying it to ChatGPT achieves better state-of-the-art results.
arXiv Detail & Related papers (2023-10-17T03:21:43Z) - CEIL: A General Classification-Enhanced Iterative Learning Framework for
Text Clustering [16.08402937918212]
We propose a novel Classification-Enhanced Iterative Learning framework for short text clustering.
In each iteration, we first adopt a language model to retrieve the initial text representations.
After strict data filtering and aggregation processes, samples with clean category labels are retrieved, which serve as supervision information.
Finally, the updated language model with improved representation ability is used to enhance clustering in the next iteration.
arXiv Detail & Related papers (2023-04-20T14:04:31Z) - Unified Multi-View Orthonormal Non-Negative Graph Based Clustering
Framework [74.25493157757943]
We formulate a novel clustering model, which exploits the non-negative feature property and incorporates the multi-view information into a unified joint learning framework.
We also explore, for the first time, the multi-model non-negative graph-based approach to clustering data based on deep features.
arXiv Detail & Related papers (2022-11-03T08:18:27Z) - SDA: Improving Text Generation with Self Data Augmentation [88.24594090105899]
We propose to improve the standard maximum likelihood estimation (MLE) paradigm by incorporating a self-imitation-learning phase for automatic data augmentation.
Unlike most existing sentence-level augmentation strategies, our method is more general and could be easily adapted to any MLE-based training procedure.
arXiv Detail & Related papers (2021-01-02T01:15:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.