Cube Sampled K-Prototype Clustering for Featured Data
- URL: http://arxiv.org/abs/2108.10262v1
- Date: Mon, 23 Aug 2021 15:59:14 GMT
- Title: Cube Sampled K-Prototype Clustering for Featured Data
- Authors: Seemandhar Jain, Aditya A. Shastri, Kapil Ahuja, Yann Busnel, and
Navneet Pratap Singh
- Abstract summary: Cube sampling is used because of its accurate sample selection.
Experiments on multiple datasets from the UCI repository demonstrate that cube sampled K-Prototype algorithm gives the best clustering accuracy.
- Score: 3.232625980782303
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Clustering large amount of data is becoming increasingly important in the
current times. Due to the large sizes of data, clustering algorithm often take
too much time. Sampling this data before clustering is commonly used to reduce
this time. In this work, we propose a probabilistic sampling technique called
cube sampling along with K-Prototype clustering. Cube sampling is used because
of its accurate sample selection. K-Prototype is most frequently used
clustering algorithm when the data is numerical as well as categorical (very
common in today's time). The novelty of this work is in obtaining the crucial
inclusion probabilities for cube sampling using Principal Component Analysis
(PCA).
Experiments on multiple datasets from the UCI repository demonstrate that
cube sampled K-Prototype algorithm gives the best clustering accuracy among
similarly sampled other popular clustering algorithms (K-Means, Hierarchical
Clustering (HC), Spectral Clustering (SC)). When compared with unsampled
K-Prototype, K-Means, HC and SC, it still has the best accuracy with the added
advantage of reduced computational complexity (due to reduced data size).
Related papers
- Fast Clustering of Categorical Big Data [1.8416014644193066]
The K-Modes algorithm, developed for clustering categorical data, suffers from unreliable performances in clustering quality and clustering efficiency.
We investigate Bisecting K-Modes (BK-Modes), a successive bisecting process to find clusters, in examining how good the cluster centers out of the bisecting process will be.
Experimental results indicated good performances of BK-Modes both in the clustering quality and efficiency for large datasets.
arXiv Detail & Related papers (2025-02-10T22:19:08Z) - Self-Supervised Graph Embedding Clustering [70.36328717683297]
K-means one-step dimensionality reduction clustering method has made some progress in addressing the curse of dimensionality in clustering tasks.
We propose a unified framework that integrates manifold learning with K-means, resulting in the self-supervised graph embedding framework.
arXiv Detail & Related papers (2024-09-24T08:59:51Z) - Determining the Optimal Number of Clusters for Time Series Datasets with
Symbolic Pattern Forest [0.0]
The problem of calculating the optimal number of clusters (say k) is one of the significant challenges for such methods.
In this work, we extended the Symbolic Pattern Forest algorithm to determine the optimal number of clusters for the time series datasets.
We tested our approach on the UCR archive datasets, and our experimental results so far showed significant improvement over the baseline.
arXiv Detail & Related papers (2023-10-01T23:33:37Z) - Superclustering by finding statistically significant separable groups of
optimal gaussian clusters [0.0]
The paper presents the algorithm for clustering a dataset by grouping the optimal, from the point of view of the BIC criterion.
An essential advantage of the algorithm is its ability to predict correct supercluster for new data based on already trained clusterer.
arXiv Detail & Related papers (2023-09-05T23:49:46Z) - Data Aggregation for Hierarchical Clustering [0.3626013617212666]
BETULA is a numerically stable version of the well known BIRCH data aggregation algorithm.
It can be used to make HAC viable on systems with constrained resources with only small losses on clustering quality.
arXiv Detail & Related papers (2023-09-05T19:39:43Z) - Rethinking k-means from manifold learning perspective [122.38667613245151]
We present a new clustering algorithm which directly detects clusters of data without mean estimation.
Specifically, we construct distance matrix between data points by Butterworth filter.
To well exploit the complementary information embedded in different views, we leverage the tensor Schatten p-norm regularization.
arXiv Detail & Related papers (2023-05-12T03:01:41Z) - K-Splits: Improved K-Means Clustering Algorithm to Automatically Detect
the Number of Clusters [0.12313056815753944]
This paper introduces k-splits, an improved hierarchical algorithm based on k-means to cluster data without prior knowledge of the number of clusters.
Accuracy and speed are two main advantages of the proposed method.
arXiv Detail & Related papers (2021-10-09T23:02:57Z) - Robust Trimmed k-means [70.88503833248159]
We propose Robust Trimmed k-means (RTKM) that simultaneously identifies outliers and clusters points.
We show RTKM performs competitively with other methods on single membership data with outliers and multi-membership data without outliers.
arXiv Detail & Related papers (2021-08-16T15:49:40Z) - Determinantal consensus clustering [77.34726150561087]
We propose the use of determinantal point processes or DPP for the random restart of clustering algorithms.
DPPs favor diversity of the center points within subsets.
We show through simulations that, contrary to DPP, this technique fails both to ensure diversity, and to obtain a good coverage of all data facets.
arXiv Detail & Related papers (2021-02-07T23:48:24Z) - Computationally efficient sparse clustering [67.95910835079825]
We provide a finite sample analysis of a new clustering algorithm based on PCA.
We show that it achieves the minimax optimal misclustering rate in the regime $|theta infty$.
arXiv Detail & Related papers (2020-05-21T17:51:30Z) - Clustering Binary Data by Application of Combinatorial Optimization
Heuristics [52.77024349608834]
We study clustering methods for binary data, first defining aggregation criteria that measure the compactness of clusters.
Five new and original methods are introduced, using neighborhoods and population behavior optimization metaheuristics.
From a set of 16 data tables generated by a quasi-Monte Carlo experiment, a comparison is performed for one of the aggregations using L1 dissimilarity, with hierarchical clustering, and a version of k-means: partitioning around medoids or PAM.
arXiv Detail & Related papers (2020-01-06T23:33:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.