Word Embeddings and Validity Indexes in Fuzzy Clustering
- URL: http://arxiv.org/abs/2205.06802v1
- Date: Tue, 26 Apr 2022 18:08:19 GMT
- Title: Word Embeddings and Validity Indexes in Fuzzy Clustering
- Authors: Danial Toufani-Movaghar, Mohammad-Reza Feizi-Derakhshi
- Abstract summary: fuzzy-based analysis of various vector representations of words, i.e., word embeddings.
We use two popular fuzzy clustering algorithms on count-based word embeddings, with different methods and dimensionality.
We evaluate results of experiments with various clustering validity indexes to compare different algorithm variation with different embeddings accuracy.
- Score: 5.063728016437489
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: In the new era of internet systems and applications, a concept of detecting
distinguished topics from huge amounts of text has gained a lot of attention.
These methods use representation of text in a numerical format -- called
embeddings -- to imitate human-based semantic similarity between words. In this
study, we perform a fuzzy-based analysis of various vector representations of
words, i.e., word embeddings. Also we introduce new methods of fuzzy clustering
based on hybrid implementation of fuzzy clustering methods with an evolutionary
algorithm named Forest Optimization. We use two popular fuzzy clustering
algorithms on count-based word embeddings, with different methods and
dimensionality. Words about covid from Kaggle dataset gathered and calculated
into vectors and clustered. The results indicate that fuzzy clustering
algorithms are very sensitive to high-dimensional data, and parameter tuning
can dramatically change their performance. We evaluate results of experiments
with various clustering validity indexes to compare different algorithm
variation with different embeddings accuracy.
Related papers
- Lightweight Conceptual Dictionary Learning for Text Classification Using Information Compression [15.460141768587663]
We propose a lightweight supervised dictionary learning framework for text classification based on data compression and representation.
We evaluate our algorithm's information-theoretic performance using information bottleneck principles and introduce the information plane area rank (IPAR) as a novel metric to quantify the information-theoretic performance.
arXiv Detail & Related papers (2024-04-28T10:11:52Z) - A Process for Topic Modelling Via Word Embeddings [0.0]
This work combines algorithms based on word embeddings, dimensionality reduction, and clustering.
The objective is to obtain topics from a set of unclassified texts.
arXiv Detail & Related papers (2023-10-06T15:10:35Z) - An Efficient Algorithm for Clustered Multi-Task Compressive Sensing [60.70532293880842]
Clustered multi-task compressive sensing is a hierarchical model that solves multiple compressive sensing tasks.
The existing inference algorithm for this model is computationally expensive and does not scale well in high dimensions.
We propose a new algorithm that substantially accelerates model inference by avoiding the need to explicitly compute these covariance matrices.
arXiv Detail & Related papers (2023-09-30T15:57:14Z) - Rethinking k-means from manifold learning perspective [122.38667613245151]
We present a new clustering algorithm which directly detects clusters of data without mean estimation.
Specifically, we construct distance matrix between data points by Butterworth filter.
To well exploit the complementary information embedded in different views, we leverage the tensor Schatten p-norm regularization.
arXiv Detail & Related papers (2023-05-12T03:01:41Z) - Clustering Plotted Data by Image Segmentation [12.443102864446223]
Clustering algorithms are one of the main analytical methods to detect patterns in unlabeled data.
In this paper, we present a wholly different way of clustering points in 2-dimensional space, inspired by how humans cluster data.
Our approach, Visual Clustering, has several advantages over traditional clustering algorithms.
arXiv Detail & Related papers (2021-10-06T06:19:30Z) - Estimating leverage scores via rank revealing methods and randomization [50.591267188664666]
We study algorithms for estimating the statistical leverage scores of rectangular dense or sparse matrices of arbitrary rank.
Our approach is based on combining rank revealing methods with compositions of dense and sparse randomized dimensionality reduction transforms.
arXiv Detail & Related papers (2021-05-23T19:21:55Z) - Fuzzy clustering algorithms with distance metric learning and entropy
regularization [0.0]
This paper proposes fuzzy clustering algorithms based on Euclidean, City-block and Mahalanobis distances and entropy regularization.
Several experiments on synthetic and real datasets, including its application to noisy image texture segmentation, demonstrate the usefulness of these adaptive clustering methods.
arXiv Detail & Related papers (2021-02-18T18:19:04Z) - Similarity-based Distance for Categorical Clustering using Space
Structure [5.543220407902113]
We have proposed a novel distance metric, similarity-based distance (SBD) to find the distance between objects of categorical data.
Our proposed distance (SBD) significantly outperforms the existing algorithms like k-modes or other SBC type algorithms when used on categorical datasets.
arXiv Detail & Related papers (2020-11-19T15:18:26Z) - Accelerating Text Mining Using Domain-Specific Stop Word Lists [57.76576681191192]
We present a novel approach for the automatic extraction of domain-specific words called the hyperplane-based approach.
The hyperplane-based approach can significantly reduce text dimensionality by eliminating irrelevant features.
Results indicate that the hyperplane-based approach can reduce the dimensionality of the corpus by 90% and outperforms mutual information.
arXiv Detail & Related papers (2020-11-18T17:42:32Z) - Anchor & Transform: Learning Sparse Embeddings for Large Vocabularies [60.285091454321055]
We design a simple and efficient embedding algorithm that learns a small set of anchor embeddings and a sparse transformation matrix.
On text classification, language modeling, and movie recommendation benchmarks, we show that ANT is particularly suitable for large vocabulary sizes.
arXiv Detail & Related papers (2020-03-18T13:07:51Z) - Conjoined Dirichlet Process [63.89763375457853]
We develop a novel, non-parametric probabilistic biclustering method based on Dirichlet processes to identify biclusters with strong co-occurrence in both rows and columns.
We apply our method to two different applications, text mining and gene expression analysis, and demonstrate that our method improves bicluster extraction in many settings compared to existing approaches.
arXiv Detail & Related papers (2020-02-08T19:41:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.