Information-Theoretic Generative Clustering of Documents
- URL: http://arxiv.org/abs/2412.13534v1
- Date: Wed, 18 Dec 2024 06:21:21 GMT
- Title: Information-Theoretic Generative Clustering of Documents
- Authors: Xin Du, Kumiko Tanaka-Ishii,
- Abstract summary: We present generative clustering (GC) for clustering a set of documents, $mathrmX$.
Because large language models (LLMs) provide probability distributions, the similarity between two documents can be rigorously defined.
We show GC achieves the state-of-the-art performance, outperforming any previous clustering method often by a large margin.
- Score: 24.56214029342293
- License:
- Abstract: We present {\em generative clustering} (GC) for clustering a set of documents, $\mathrm{X}$, by using texts $\mathrm{Y}$ generated by large language models (LLMs) instead of by clustering the original documents $\mathrm{X}$. Because LLMs provide probability distributions, the similarity between two documents can be rigorously defined in an information-theoretic manner by the KL divergence. We also propose a natural, novel clustering algorithm by using importance sampling. We show that GC achieves the state-of-the-art performance, outperforming any previous clustering method often by a large margin. Furthermore, we show an application to generative document retrieval in which documents are indexed via hierarchical clustering and our method improves the retrieval accuracy.
Related papers
- k-LLMmeans: Summaries as Centroids for Interpretable and Scalable LLM-Based Text Clustering [0.0]
We introduce k-LLMmeans, a novel modification of the k-means clustering algorithm that utilizes LLMs to generate textual summaries as cluster centroids.
This modification preserves the properties of k-means while offering greater interpretability.
We present a case study showcasing the interpretability of evolving cluster centroids in sequential text streams.
arXiv Detail & Related papers (2025-02-12T19:50:22Z) - Graph-Convolutional Networks: Named Entity Recognition and Large Language Model Embedding in Document Clustering [9.929301228994095]
This paper proposes a novel approach that integrates Named Entity Recognition (NER) and Large Language Models (LLMs) embeddings within a graph-based framework for document clustering.
The method builds a graph with nodes representing documents and edges weighted by named entity similarity, optimized using a graph-convolutional network (GCN)
Experimental results indicate that our approach outperforms conventional co-occurrence-based methods in clustering, notably for documents rich in named entities.
arXiv Detail & Related papers (2024-12-19T14:03:22Z) - Generative Dense Retrieval: Memory Can Be a Burden [16.964086245755798]
Generative Retrieval (GR) autoregressively decodes relevant document identifiers given a query.
Dense Retrieval (DR) is introduced to conduct fine-grained intra-cluster matching from clusters to relevant documents.
DR obtains an average of 3.0 R@100 improvement on NQ dataset under multiple settings.
arXiv Detail & Related papers (2024-01-19T04:24:07Z) - Generalized Category Discovery with Clustering Assignment Consistency [56.92546133591019]
Generalized category discovery (GCD) is a recently proposed open-world task.
We propose a co-training-based framework that encourages clustering consistency.
Our method achieves state-of-the-art performance on three generic benchmarks and three fine-grained visual recognition datasets.
arXiv Detail & Related papers (2023-10-30T00:32:47Z) - Large Language Models Enable Few-Shot Clustering [88.06276828752553]
We show that large language models can amplify an expert's guidance to enable query-efficient, few-shot semi-supervised text clustering.
We find incorporating LLMs in the first two stages can routinely provide significant improvements in cluster quality.
arXiv Detail & Related papers (2023-07-02T09:17:11Z) - Revisiting Instance-Optimal Cluster Recovery in the Labeled Stochastic Block Model [69.15976031704687]
We propose IAC (Instance-Adaptive Clustering), the first algorithm whose performance matches the instance-specific lower bounds both in expectation and with high probability.
IAC maintains an overall computational complexity of $ mathcalO(n, textpolylog(n) $, making it scalable and practical for large-scale problems.
arXiv Detail & Related papers (2023-06-18T08:46:06Z) - Implicit Sample Extension for Unsupervised Person Re-Identification [97.46045935897608]
Clustering sometimes mixes different true identities together or splits the same identity into two or more sub clusters.
We propose an Implicit Sample Extension (OurWholeMethod) method to generate what we call support samples around the cluster boundaries.
Experiments demonstrate that the proposed method is effective and achieves state-of-the-art performance for unsupervised person Re-ID.
arXiv Detail & Related papers (2022-04-14T11:41:48Z) - A Proposition-Level Clustering Approach for Multi-Document Summarization [82.4616498914049]
We revisit the clustering approach, grouping together propositions for more precise information alignment.
Our method detects salient propositions, clusters them into paraphrastic clusters, and generates a representative sentence for each cluster by fusing its propositions.
Our summarization method improves over the previous state-of-the-art MDS method in the DUC 2004 and TAC 2011 datasets.
arXiv Detail & Related papers (2021-12-16T10:34:22Z) - Top-Down Deep Clustering with Multi-generator GANs [0.0]
Deep clustering (DC) learns embedding spaces that are optimal for cluster analysis.
We propose HC-MGAN, a new technique based on GANs with multiple generators (MGANs)
Our method is inspired by the observation that each generator of a MGAN tends to generate data that correlates with a sub-region of the real data distribution.
arXiv Detail & Related papers (2021-12-06T22:53:12Z) - Vec2GC -- A Graph Based Clustering Method for Text Representations [0.0]
Vec2GC is an end-to-end pipeline to cluster terms or documents for any given text corpus.
Vec2GC clustering algorithm is a density based approach, that supports hierarchical clustering as well.
arXiv Detail & Related papers (2021-04-15T12:52:30Z) - Scalable Hierarchical Agglomerative Clustering [65.66407726145619]
Existing scalable hierarchical clustering methods sacrifice quality for speed.
We present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points.
arXiv Detail & Related papers (2020-10-22T15:58:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.