Related papers: Hybrid Multisource Feature Fusion for the Text Clustering

Hybrid Multisource Feature Fusion for the Text Clustering

URL: http://arxiv.org/abs/2108.10926v1
Date: Tue, 24 Aug 2021 19:32:09 GMT
Title: Hybrid Multisource Feature Fusion for the Text Clustering
Authors: Jiaxuan Chen and Shenglin Gui
Abstract summary: The text clustering technique is an unsupervised text mining method which are used to partition a huge amount of text documents into groups. We present a hybrid multisource feature fusion (HMFF) framework comprising three components, feature representation of multimodel, mutual similarity matrices and feature fusion. Our HMFF framework outperforms other recently published algorithms on 7 of 11 public benchmark datasets and has the leading performance on the rest 4 benchmark datasets as well.
Score: 5.5586788751870175
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The text clustering technique is an unsupervised text mining method which are used to partition a huge amount of text documents into groups. It has been reported that text clustering algorithms are hard to achieve better performance than supervised methods and their clustering performance is highly dependent on the picked text features. Currently, there are many different types of text feature generation algorithms, each of which extracts text features from some specific aspects, such as VSM and distributed word embedding, thus seeking a new way of obtaining features as complete as possible from the corpus is the key to enhance the clustering effects. In this paper, we present a hybrid multisource feature fusion (HMFF) framework comprising three components, feature representation of multimodel, mutual similarity matrices and feature fusion, in which we construct mutual similarity matrices for each feature source and fuse discriminative features from mutual similarity matrices by reducing dimensionality to generate HMFF features, then k-means clustering algorithm could be configured to partition input samples into groups. The experimental tests show our HMFF framework outperforms other recently published algorithms on 7 of 11 public benchmark datasets and has the leading performance on the rest 4 benchmark datasets as well. At last, we compare HMFF framework with those competitors on a COVID-19 dataset from the wild with the unknown cluster count, which shows the clusters generated by HMFF framework partition those similar samples much closer.

Related papers

Multi-Vector Index Compression in Any Modality [73.7330345057813]
Late interaction has emerged as a dominant paradigm for information retrieval in text, images, visual documents, and videos.<n>We introduce four approaches for index compression: sequence resizing, memory tokens, hierarchical pooling, and a novel attention-guided clustering (AGC)<n>AGC uses an attention-guided mechanism to identify the most semantically salient regions of a document as cluster centroids and to weight token aggregation.
arXiv Detail & Related papers (2026-02-24T18:57:33Z)
ClusterFusion: Hybrid Clustering with Embedding Guidance and LLM Adaptation [52.794544682493814]
Large language models (LLMs) provide strong contextual reasoning, yet prior work mainly uses them as auxiliary modules to refine embeddings or adjust cluster boundaries.<n>We propose ClusterFusion, a hybrid framework that treats the LLM as the clustering core, guided by lightweight embedding methods.<n> Experiments on three public benchmarks and two new domain-specific datasets demonstrate that ClusterFusion achieves state-of-the-art performance on standard tasks.
arXiv Detail & Related papers (2025-12-04T00:49:43Z)
ESMC: MLLM-Based Embedding Selection for Explainable Multiple Clustering [79.69917150582633]
Multi-modal large language models (MLLMs) can be leveraged to achieve user-driven clustering.<n>Our method first discovers that MLLMs' hidden states of text tokens are strongly related to the corresponding features.<n>We also employ a lightweight clustering head augmented with pseudo-label learning, significantly enhancing clustering accuracy.
arXiv Detail & Related papers (2025-11-30T04:36:51Z)
In-Context Clustering with Large Language Models [50.25868718329313]
ICC captures complex relationships among inputs through an attention mechanism.<n>We show that pretrained LLMs exhibit impressive zero-shot clustering capabilities on text-encoded numeric data.<n>Our work extends in-context learning to an unsupervised setting, showcasing the effectiveness and flexibility of LLMs for clustering.
arXiv Detail & Related papers (2025-10-09T17:07:55Z)
An Enhanced Model-based Approach for Short Text Clustering [58.60681789677676]
Short text clustering has become increasingly important with the popularity of social media like Twitter, Google+, and Facebook.<n>Existing methods can be broadly categorized into two paradigms: topic model-based approaches and deep representation learning-based approaches.<n>We propose a collapsed Gibbs Sampling algorithm for the Dirichlet Multinomial Mixture model (GSDMM), which effectively handles the sparsity and high dimensionality of short texts.<n>Based on several aspects of GSDMM that warrant further refinement, we propose an improved approach, GSDMM+, designed to further optimize its performance.
arXiv Detail & Related papers (2025-07-18T10:07:42Z)
Cequel: Cost-Effective Querying of Large Language Models for Text Clustering [15.179854529085544]
Text clustering aims to automatically partition a collection of documents into coherent groups based on their linguistic features.<n>Recent advances in large language models (LLMs) have significantly improved this field by providing high-quality contextualized embeddings.<n>We propose Cequel, a cost-effective framework that achieves accurate text clustering under a limited budget of LLM queries.
arXiv Detail & Related papers (2025-04-22T06:57:49Z)
k-LLMmeans: Summaries as Centroids for Interpretable and Scalable LLM-Based Text Clustering [0.0]
We introduce k-LLMmeans, a novel modification of the k-means clustering algorithm that utilizes LLMs to generate textual summaries as cluster centroids. This modification preserves the properties of k-means while offering greater interpretability. We present a case study showcasing the interpretability of evolving cluster centroids in sequential text streams.
arXiv Detail & Related papers (2025-02-12T19:50:22Z)
Text Clustering with LLM Embeddings [0.0]
The effectiveness of text clustering largely depends on the selection of textual embeddings and clustering algorithms. Recent advancements in large language models (LLMs) have the potential to enhance this task. Findings indicate that LLM embeddings are superior at capturing subtleties in structured language.
arXiv Detail & Related papers (2024-03-22T11:08:48Z)
Mixture of multilayer stochastic block models for multiview clustering [0.0]
We propose an original method for aggregating multiple clustering coming from different sources of information. The identifiability of the model parameters is established and a variational Bayesian EM algorithm is proposed for the estimation of these parameters. The method is utilized to analyze global food trading networks, leading to structures of interest.
arXiv Detail & Related papers (2024-01-09T17:15:47Z)
Uncovering Prototypical Knowledge for Weakly Open-Vocabulary Semantic Segmentation [59.37587762543934]
This paper studies the problem of weakly open-vocabulary semantic segmentation (WOVSS) Existing methods suffer from a granularity inconsistency regarding the usage of group tokens. We propose the prototypical guidance network (PGSeg) that incorporates multi-modal regularization.
arXiv Detail & Related papers (2023-10-29T13:18:00Z)
M$^3$Net: Multi-view Encoding, Matching, and Fusion for Few-shot Fine-grained Action Recognition [80.21796574234287]
M$3$Net is a matching-based framework for few-shot fine-grained (FS-FG) action recognition. It incorporates textitmulti-view encoding, textitmulti-view matching, and textitmulti-view fusion to facilitate embedding encoding, similarity matching, and decision making. Explainable visualizations and experimental results demonstrate the superiority of M$3$Net in capturing fine-grained action details.
arXiv Detail & Related papers (2023-08-06T09:15:14Z)
Influence of various text embeddings on clustering performance in NLP [0.0]
A clustering approach can be used to relabel the correct star ratings by grouping the text reviews into individual groups. In this work, we explore the task of choosing different text embeddings to represent these reviews and also explore the impact the embedding choice has on the performance of various classes of clustering algorithms.
arXiv Detail & Related papers (2023-05-04T20:53:19Z)
ClusTop: An unsupervised and integrated text clustering and topic extraction framework [3.3073775218038883]
We propose an unsupervised text clustering and topic extraction framework (ClusTop) Our framework includes four components: enhanced language model training, dimensionality reduction, clustering and topic extraction. Experiments on two datasets demonstrate the effectiveness of our framework.
arXiv Detail & Related papers (2023-01-03T03:26:26Z)
A framework for benchmarking clustering algorithms [2.900810893770134]
Clustering algorithms can be tested on a variety of benchmark problems. Many research papers and graduate theses consider only a small number of datasets. We have developed a framework whose aim is to introduce a consistent methodology for testing clustering algorithms.
arXiv Detail & Related papers (2022-09-20T06:10:41Z)
Mixture Model Auto-Encoders: Deep Clustering through Dictionary Learning [72.9458277424712]
Mixture Model Auto-Encoders (MixMate) is a novel architecture that clusters data by performing inference on a generative model. We show that MixMate achieves competitive performance compared to state-of-the-art deep clustering algorithms.
arXiv Detail & Related papers (2021-10-10T02:30:31Z)
Biclustering with Alternating K-Means [5.089110111757978]
We provide a new formulation of the biclustering problem based on the idea of minimizing the empirical clustering risk. We propose a simple and novel algorithm that finds a local minimum by alternating the use of an adapted version of the k-means clustering algorithm between columns and rows. The results demonstrate that our algorithm is able to detect meaningful structures in the data and outperform other competing biclustering methods in various settings and situations.
arXiv Detail & Related papers (2020-09-09T20:15:24Z)
Unsupervised Multi-view Clustering by Squeezing Hybrid Knowledge from Cross View and Each View [68.88732535086338]
This paper proposes a new multi-view clustering method, low-rank subspace multi-view clustering based on adaptive graph regularization. Experimental results for five widely used multi-view benchmarks show that our proposed algorithm surpasses other state-of-the-art methods by a clear margin.
arXiv Detail & Related papers (2020-08-23T08:25:06Z)
New advances in enumerative biclustering algorithms with online partitioning [80.22629846165306]
This paper further extends RIn-Close_CVC, a biclustering algorithm capable of performing an efficient, complete, correct and non-redundant enumeration of maximal biclusters with constant values on columns in numerical datasets. The improved algorithm is called RIn-Close_CVC3, keeps those attractive properties of RIn-Close_CVC, and is characterized by: a drastic reduction in memory usage; a consistent gain in runtime.
arXiv Detail & Related papers (2020-03-07T14:54:26Z)
Conjoined Dirichlet Process [63.89763375457853]
We develop a novel, non-parametric probabilistic biclustering method based on Dirichlet processes to identify biclusters with strong co-occurrence in both rows and columns. We apply our method to two different applications, text mining and gene expression analysis, and demonstrate that our method improves bicluster extraction in many settings compared to existing approaches.
arXiv Detail & Related papers (2020-02-08T19:41:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.