Related papers: Information-Theoretic Generative Clustering of Documents

Information-Theoretic Generative Clustering of Documents

URL: http://arxiv.org/abs/2412.13534v1
Date: Wed, 18 Dec 2024 06:21:21 GMT
Title: Information-Theoretic Generative Clustering of Documents
Authors: Xin Du, Kumiko Tanaka-Ishii,
Abstract summary: We present generative clustering (GC) for clustering a set of documents, $mathrmX$.<n>Because large language models (LLMs) provide probability distributions, the similarity between two documents can be rigorously defined.<n>We show GC achieves the state-of-the-art performance, outperforming any previous clustering method often by a large margin.
Score: 24.56214029342293
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present {\em generative clustering} (GC) for clustering a set of documents, $\mathrm{X}$, by using texts $\mathrm{Y}$ generated by large language models (LLMs) instead of by clustering the original documents $\mathrm{X}$. Because LLMs provide probability distributions, the similarity between two documents can be rigorously defined in an information-theoretic manner by the KL divergence. We also propose a natural, novel clustering algorithm by using importance sampling. We show that GC achieves the state-of-the-art performance, outperforming any previous clustering method often by a large margin. Furthermore, we show an application to generative document retrieval in which documents are indexed via hierarchical clustering and our method improves the retrieval accuracy.

Related papers

An Enhanced Model-based Approach for Short Text Clustering [58.60681789677676]
Short text clustering has become increasingly important with the popularity of social media like Twitter, Google+, and Facebook.<n>Existing methods can be broadly categorized into two paradigms: topic model-based approaches and deep representation learning-based approaches.<n>We propose a collapsed Gibbs Sampling algorithm for the Dirichlet Multinomial Mixture model (GSDMM), which effectively handles the sparsity and high dimensionality of short texts.<n>Based on several aspects of GSDMM that warrant further refinement, we propose an improved approach, GSDMM+, designed to further optimize its performance.
arXiv Detail & Related papers (2025-07-18T10:07:42Z)
Hierarchical clustering with maximum density paths and mixture models [39.42511559155036]
Hierarchical clustering is an effective and interpretable technique for analyzing structure in data. It is particularly helpful in settings where the exact number of clusters is unknown, and provides a robust framework for exploring complex datasets. Our method addresses this limitation by leveraging a two-stage approach, first employing a Gaussian or Student's t mixture model to overcluster the data, and then hierarchically merging clusters based on the induced density landscape. This approach yields state-of-the-art clustering performance while also providing a meaningful hierarchy, making it a valuable tool for exploratory data analysis.
arXiv Detail & Related papers (2025-03-19T15:37:51Z)
k-LLMmeans: Summaries as Centroids for Interpretable and Scalable LLM-Based Text Clustering [0.0]
We introduce k-LLMmeans, a novel modification of the k-means clustering algorithm that utilizes LLMs to generate textual summaries as cluster centroids. This modification preserves the properties of k-means while offering greater interpretability. We present a case study showcasing the interpretability of evolving cluster centroids in sequential text streams.
arXiv Detail & Related papers (2025-02-12T19:50:22Z)
Graph-Convolutional Networks: Named Entity Recognition and Large Language Model Embedding in Document Clustering [9.929301228994095]
This paper proposes a novel approach that integrates Named Entity Recognition (NER) and Large Language Models (LLMs) embeddings within a graph-based framework for document clustering. The method builds a graph with nodes representing documents and edges weighted by named entity similarity, optimized using a graph-convolutional network (GCN) Experimental results indicate that our approach outperforms conventional co-occurrence-based methods in clustering, notably for documents rich in named entities.
arXiv Detail & Related papers (2024-12-19T14:03:22Z)
Generative Dense Retrieval: Memory Can Be a Burden [16.964086245755798]
Generative Retrieval (GR) autoregressively decodes relevant document identifiers given a query. Dense Retrieval (DR) is introduced to conduct fine-grained intra-cluster matching from clusters to relevant documents. DR obtains an average of 3.0 R@100 improvement on NQ dataset under multiple settings.
arXiv Detail & Related papers (2024-01-19T04:24:07Z)
Generalized Category Discovery with Clustering Assignment Consistency [56.92546133591019]
Generalized category discovery (GCD) is a recently proposed open-world task. We propose a co-training-based framework that encourages clustering consistency. Our method achieves state-of-the-art performance on three generic benchmarks and three fine-grained visual recognition datasets.
arXiv Detail & Related papers (2023-10-30T00:32:47Z)
Large Language Models Enable Few-Shot Clustering [88.06276828752553]
We show that large language models can amplify an expert's guidance to enable query-efficient, few-shot semi-supervised text clustering. We find incorporating LLMs in the first two stages can routinely provide significant improvements in cluster quality.
arXiv Detail & Related papers (2023-07-02T09:17:11Z)
Instance-Optimal Cluster Recovery in the Labeled Stochastic Block Model [79.46465138631592]
We devise an efficient algorithm that recovers clusters using the observed labels. We present Instance-Adaptive Clustering (IAC), the first algorithm whose performance matches these lower bounds both in expectation and with high probability.
arXiv Detail & Related papers (2023-06-18T08:46:06Z)
Implicit Sample Extension for Unsupervised Person Re-Identification [97.46045935897608]
Clustering sometimes mixes different true identities together or splits the same identity into two or more sub clusters. We propose an Implicit Sample Extension (OurWholeMethod) method to generate what we call support samples around the cluster boundaries. Experiments demonstrate that the proposed method is effective and achieves state-of-the-art performance for unsupervised person Re-ID.
arXiv Detail & Related papers (2022-04-14T11:41:48Z)
A Proposition-Level Clustering Approach for Multi-Document Summarization [82.4616498914049]
We revisit the clustering approach, grouping together propositions for more precise information alignment. Our method detects salient propositions, clusters them into paraphrastic clusters, and generates a representative sentence for each cluster by fusing its propositions. Our summarization method improves over the previous state-of-the-art MDS method in the DUC 2004 and TAC 2011 datasets.
arXiv Detail & Related papers (2021-12-16T10:34:22Z)
Top-Down Deep Clustering with Multi-generator GANs [0.0]
Deep clustering (DC) learns embedding spaces that are optimal for cluster analysis. We propose HC-MGAN, a new technique based on GANs with multiple generators (MGANs) Our method is inspired by the observation that each generator of a MGAN tends to generate data that correlates with a sub-region of the real data distribution.
arXiv Detail & Related papers (2021-12-06T22:53:12Z)
Vec2GC -- A Graph Based Clustering Method for Text Representations [0.0]
Vec2GC is an end-to-end pipeline to cluster terms or documents for any given text corpus. Vec2GC clustering algorithm is a density based approach, that supports hierarchical clustering as well.
arXiv Detail & Related papers (2021-04-15T12:52:30Z)
Event-Driven News Stream Clustering using Entity-Aware Contextual Embeddings [14.225334321146779]
We propose a method for online news stream clustering that is a variant of the non-parametric streaming K-means algorithm. Our model uses a combination of sparse and dense document representations, aggregates document-cluster similarity along these multiple representations. We show that the use of a suitable fine-tuning objective and external knowledge in pre-trained transformer models yields significant improvements in the effectiveness of contextual embeddings.
arXiv Detail & Related papers (2021-01-26T19:58:30Z)
Scalable Hierarchical Agglomerative Clustering [65.66407726145619]
Existing scalable hierarchical clustering methods sacrifice quality for speed. We present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points.
arXiv Detail & Related papers (2020-10-22T15:58:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.