Hybrid Multisource Feature Fusion for the Text Clustering
- URL: http://arxiv.org/abs/2108.10926v1
- Date: Tue, 24 Aug 2021 19:32:09 GMT
- Title: Hybrid Multisource Feature Fusion for the Text Clustering
- Authors: Jiaxuan Chen and Shenglin Gui
- Abstract summary: The text clustering technique is an unsupervised text mining method which are used to partition a huge amount of text documents into groups.
We present a hybrid multisource feature fusion (HMFF) framework comprising three components, feature representation of multimodel, mutual similarity matrices and feature fusion.
Our HMFF framework outperforms other recently published algorithms on 7 of 11 public benchmark datasets and has the leading performance on the rest 4 benchmark datasets as well.
- Score: 5.5586788751870175
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The text clustering technique is an unsupervised text mining method which are
used to partition a huge amount of text documents into groups. It has been
reported that text clustering algorithms are hard to achieve better performance
than supervised methods and their clustering performance is highly dependent on
the picked text features. Currently, there are many different types of text
feature generation algorithms, each of which extracts text features from some
specific aspects, such as VSM and distributed word embedding, thus seeking a
new way of obtaining features as complete as possible from the corpus is the
key to enhance the clustering effects. In this paper, we present a hybrid
multisource feature fusion (HMFF) framework comprising three components,
feature representation of multimodel, mutual similarity matrices and feature
fusion, in which we construct mutual similarity matrices for each feature
source and fuse discriminative features from mutual similarity matrices by
reducing dimensionality to generate HMFF features, then k-means clustering
algorithm could be configured to partition input samples into groups. The
experimental tests show our HMFF framework outperforms other recently published
algorithms on 7 of 11 public benchmark datasets and has the leading performance
on the rest 4 benchmark datasets as well. At last, we compare HMFF framework
with those competitors on a COVID-19 dataset from the wild with the unknown
cluster count, which shows the clusters generated by HMFF framework partition
those similar samples much closer.
Related papers
- Text Clustering with LLM Embeddings [0.0]
The effectiveness of text clustering largely depends on the selection of textual embeddings and clustering algorithms.
Recent advancements in large language models (LLMs) have the potential to enhance this task.
Findings indicate that LLM embeddings are superior at capturing subtleties in structured language.
arXiv Detail & Related papers (2024-03-22T11:08:48Z) - Uncovering Prototypical Knowledge for Weakly Open-Vocabulary Semantic
Segmentation [59.37587762543934]
This paper studies the problem of weakly open-vocabulary semantic segmentation (WOVSS)
Existing methods suffer from a granularity inconsistency regarding the usage of group tokens.
We propose the prototypical guidance network (PGSeg) that incorporates multi-modal regularization.
arXiv Detail & Related papers (2023-10-29T13:18:00Z) - M$^3$Net: Multi-view Encoding, Matching, and Fusion for Few-shot
Fine-grained Action Recognition [80.21796574234287]
M$3$Net is a matching-based framework for few-shot fine-grained (FS-FG) action recognition.
It incorporates textitmulti-view encoding, textitmulti-view matching, and textitmulti-view fusion to facilitate embedding encoding, similarity matching, and decision making.
Explainable visualizations and experimental results demonstrate the superiority of M$3$Net in capturing fine-grained action details.
arXiv Detail & Related papers (2023-08-06T09:15:14Z) - Influence of various text embeddings on clustering performance in NLP [0.0]
A clustering approach can be used to relabel the correct star ratings by grouping the text reviews into individual groups.
In this work, we explore the task of choosing different text embeddings to represent these reviews and also explore the impact the embedding choice has on the performance of various classes of clustering algorithms.
arXiv Detail & Related papers (2023-05-04T20:53:19Z) - ClusTop: An unsupervised and integrated text clustering and topic
extraction framework [3.3073775218038883]
We propose an unsupervised text clustering and topic extraction framework (ClusTop)
Our framework includes four components: enhanced language model training, dimensionality reduction, clustering and topic extraction.
Experiments on two datasets demonstrate the effectiveness of our framework.
arXiv Detail & Related papers (2023-01-03T03:26:26Z) - A framework for benchmarking clustering algorithms [2.900810893770134]
Clustering algorithms can be tested on a variety of benchmark problems.
Many research papers and graduate theses consider only a small number of datasets.
We have developed a framework whose aim is to introduce a consistent methodology for testing clustering algorithms.
arXiv Detail & Related papers (2022-09-20T06:10:41Z) - Mixture Model Auto-Encoders: Deep Clustering through Dictionary Learning [72.9458277424712]
Mixture Model Auto-Encoders (MixMate) is a novel architecture that clusters data by performing inference on a generative model.
We show that MixMate achieves competitive performance compared to state-of-the-art deep clustering algorithms.
arXiv Detail & Related papers (2021-10-10T02:30:31Z) - Biclustering with Alternating K-Means [5.089110111757978]
We provide a new formulation of the biclustering problem based on the idea of minimizing the empirical clustering risk.
We propose a simple and novel algorithm that finds a local minimum by alternating the use of an adapted version of the k-means clustering algorithm between columns and rows.
The results demonstrate that our algorithm is able to detect meaningful structures in the data and outperform other competing biclustering methods in various settings and situations.
arXiv Detail & Related papers (2020-09-09T20:15:24Z) - Unsupervised Multi-view Clustering by Squeezing Hybrid Knowledge from
Cross View and Each View [68.88732535086338]
This paper proposes a new multi-view clustering method, low-rank subspace multi-view clustering based on adaptive graph regularization.
Experimental results for five widely used multi-view benchmarks show that our proposed algorithm surpasses other state-of-the-art methods by a clear margin.
arXiv Detail & Related papers (2020-08-23T08:25:06Z) - New advances in enumerative biclustering algorithms with online
partitioning [80.22629846165306]
This paper further extends RIn-Close_CVC, a biclustering algorithm capable of performing an efficient, complete, correct and non-redundant enumeration of maximal biclusters with constant values on columns in numerical datasets.
The improved algorithm is called RIn-Close_CVC3, keeps those attractive properties of RIn-Close_CVC, and is characterized by: a drastic reduction in memory usage; a consistent gain in runtime.
arXiv Detail & Related papers (2020-03-07T14:54:26Z) - Conjoined Dirichlet Process [63.89763375457853]
We develop a novel, non-parametric probabilistic biclustering method based on Dirichlet processes to identify biclusters with strong co-occurrence in both rows and columns.
We apply our method to two different applications, text mining and gene expression analysis, and demonstrate that our method improves bicluster extraction in many settings compared to existing approaches.
arXiv Detail & Related papers (2020-02-08T19:41:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.