Related papers: Word Embeddings and Validity Indexes in Fuzzy Clustering

Word Embeddings and Validity Indexes in Fuzzy Clustering

URL: http://arxiv.org/abs/2205.06802v1
Date: Tue, 26 Apr 2022 18:08:19 GMT
Title: Word Embeddings and Validity Indexes in Fuzzy Clustering
Authors: Danial Toufani-Movaghar, Mohammad-Reza Feizi-Derakhshi
Abstract summary: fuzzy-based analysis of various vector representations of words, i.e., word embeddings. We use two popular fuzzy clustering algorithms on count-based word embeddings, with different methods and dimensionality. We evaluate results of experiments with various clustering validity indexes to compare different algorithm variation with different embeddings accuracy.
Score: 5.063728016437489
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: In the new era of internet systems and applications, a concept of detecting distinguished topics from huge amounts of text has gained a lot of attention. These methods use representation of text in a numerical format -- called embeddings -- to imitate human-based semantic similarity between words. In this study, we perform a fuzzy-based analysis of various vector representations of words, i.e., word embeddings. Also we introduce new methods of fuzzy clustering based on hybrid implementation of fuzzy clustering methods with an evolutionary algorithm named Forest Optimization. We use two popular fuzzy clustering algorithms on count-based word embeddings, with different methods and dimensionality. Words about covid from Kaggle dataset gathered and calculated into vectors and clustered. The results indicate that fuzzy clustering algorithms are very sensitive to high-dimensional data, and parameter tuning can dramatically change their performance. We evaluate results of experiments with various clustering validity indexes to compare different algorithm variation with different embeddings accuracy.

Related papers

An Enhanced Model-based Approach for Short Text Clustering [58.60681789677676]
Short text clustering has become increasingly important with the popularity of social media like Twitter, Google+, and Facebook.<n>Existing methods can be broadly categorized into two paradigms: topic model-based approaches and deep representation learning-based approaches.<n>We propose a collapsed Gibbs Sampling algorithm for the Dirichlet Multinomial Mixture model (GSDMM), which effectively handles the sparsity and high dimensionality of short texts.<n>Based on several aspects of GSDMM that warrant further refinement, we propose an improved approach, GSDMM+, designed to further optimize its performance.
arXiv Detail & Related papers (2025-07-18T10:07:42Z)
Evolutionary Algorithms Approach For Search Based On Semantic Document Similarity [0.0]
We develop clustering, recommendation, and question-and-answering systems using various text representation techniques. We show that Universal Sentence vectors (USE) is used to capture the semantic similarity of text. And the transfer learning technique is used to apply Genetic Algorithm (GA) and Differential Evolution (DE) algorithms to search and retrieve relevant top N documents.
arXiv Detail & Related papers (2025-02-20T18:56:52Z)
Lightweight Conceptual Dictionary Learning for Text Classification Using Information Compression [15.460141768587663]
We propose a lightweight supervised dictionary learning framework for text classification based on data compression and representation. We evaluate our algorithm's information-theoretic performance using information bottleneck principles and introduce the information plane area rank (IPAR) as a novel metric to quantify the information-theoretic performance.
arXiv Detail & Related papers (2024-04-28T10:11:52Z)
A Process for Topic Modelling Via Word Embeddings [0.0]
This work combines algorithms based on word embeddings, dimensionality reduction, and clustering. The objective is to obtain topics from a set of unclassified texts.
arXiv Detail & Related papers (2023-10-06T15:10:35Z)
An Efficient Algorithm for Clustered Multi-Task Compressive Sensing [60.70532293880842]
Clustered multi-task compressive sensing is a hierarchical model that solves multiple compressive sensing tasks. The existing inference algorithm for this model is computationally expensive and does not scale well in high dimensions. We propose a new algorithm that substantially accelerates model inference by avoiding the need to explicitly compute these covariance matrices.
arXiv Detail & Related papers (2023-09-30T15:57:14Z)
Rethinking k-means from manifold learning perspective [122.38667613245151]
We present a new clustering algorithm which directly detects clusters of data without mean estimation. Specifically, we construct distance matrix between data points by Butterworth filter. To well exploit the complementary information embedded in different views, we leverage the tensor Schatten p-norm regularization.
arXiv Detail & Related papers (2023-05-12T03:01:41Z)
Information Retrieval in long documents: Word clustering approach for improving Semantics [0.0]
We propose an alternative to deep neural networks for semantic information retrieval for the case of long documents.<n>This new approach exploiting clustering techniques takes into account the meaning of words in Information Retrieval systems targeting long as well as short documents.
arXiv Detail & Related papers (2023-02-20T18:32:57Z)
Clustering Plotted Data by Image Segmentation [12.443102864446223]
Clustering algorithms are one of the main analytical methods to detect patterns in unlabeled data. In this paper, we present a wholly different way of clustering points in 2-dimensional space, inspired by how humans cluster data. Our approach, Visual Clustering, has several advantages over traditional clustering algorithms.
arXiv Detail & Related papers (2021-10-06T06:19:30Z)
Estimating leverage scores via rank revealing methods and randomization [50.591267188664666]
We study algorithms for estimating the statistical leverage scores of rectangular dense or sparse matrices of arbitrary rank. Our approach is based on combining rank revealing methods with compositions of dense and sparse randomized dimensionality reduction transforms.
arXiv Detail & Related papers (2021-05-23T19:21:55Z)
Fuzzy clustering algorithms with distance metric learning and entropy regularization [0.0]
This paper proposes fuzzy clustering algorithms based on Euclidean, City-block and Mahalanobis distances and entropy regularization. Several experiments on synthetic and real datasets, including its application to noisy image texture segmentation, demonstrate the usefulness of these adaptive clustering methods.
arXiv Detail & Related papers (2021-02-18T18:19:04Z)
Similarity-based Distance for Categorical Clustering using Space Structure [5.543220407902113]
We have proposed a novel distance metric, similarity-based distance (SBD) to find the distance between objects of categorical data. Our proposed distance (SBD) significantly outperforms the existing algorithms like k-modes or other SBC type algorithms when used on categorical datasets.
arXiv Detail & Related papers (2020-11-19T15:18:26Z)
Accelerating Text Mining Using Domain-Specific Stop Word Lists [57.76576681191192]
We present a novel approach for the automatic extraction of domain-specific words called the hyperplane-based approach. The hyperplane-based approach can significantly reduce text dimensionality by eliminating irrelevant features. Results indicate that the hyperplane-based approach can reduce the dimensionality of the corpus by 90% and outperforms mutual information.
arXiv Detail & Related papers (2020-11-18T17:42:32Z)
Anchor & Transform: Learning Sparse Embeddings for Large Vocabularies [60.285091454321055]
We design a simple and efficient embedding algorithm that learns a small set of anchor embeddings and a sparse transformation matrix. On text classification, language modeling, and movie recommendation benchmarks, we show that ANT is particularly suitable for large vocabulary sizes.
arXiv Detail & Related papers (2020-03-18T13:07:51Z)
Conjoined Dirichlet Process [63.89763375457853]
We develop a novel, non-parametric probabilistic biclustering method based on Dirichlet processes to identify biclusters with strong co-occurrence in both rows and columns. We apply our method to two different applications, text mining and gene expression analysis, and demonstrate that our method improves bicluster extraction in many settings compared to existing approaches.
arXiv Detail & Related papers (2020-02-08T19:41:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.