K-Means Clustering With Incomplete Data with the Use of Mahalanobis Distances
- URL: http://arxiv.org/abs/2411.00870v1
- Date: Thu, 31 Oct 2024 00:05:09 GMT
- Title: K-Means Clustering With Incomplete Data with the Use of Mahalanobis Distances
- Authors: Lovis Kwasi Armah, Igor Melnykov,
- Abstract summary: We develop a unified K-means algorithm that incorporates Mahalanobis distances, instead of the traditional Euclidean distances.
We demonstrate that our algorithm consistently outperforms both standalone imputation followed by K-means.
These results hold across both the IRIS dataset and randomly generated data with elliptical clusters.
- Score: 0.0
- License:
- Abstract: Effectively applying the K-means algorithm to data with missing values remains an important research area due to its impact on applications that rely on K-means clustering. Recent studies have shown that integrating imputation directly into the K-means algorithm yields superior results compared to handling imputation separately. In this work, we extend this approach by developing a unified K-means algorithm that incorporates Mahalanobis distances, instead of the traditional Euclidean distances, which previous research has shown to perform better for clusters with elliptical shapes. We conduct extensive experiments on synthetic datasets containing up to ten elliptical clusters, as well as the IRIS dataset. Using the Adjusted Rand Index (ARI) and Normalized Mutual Information (NMI), we demonstrate that our algorithm consistently outperforms both standalone imputation followed by K-means (using either Mahalanobis or Euclidean distance) and recent K-means algorithms that integrate imputation and clustering for handling incomplete data. These results hold across both the IRIS dataset and randomly generated data with elliptical clusters.
Related papers
- Inference with K-means [0.0]
k-means is an iterative clustering algorithm that randomly assigns k centroids, then assigns data points to the nearest centroid, and updates centroids based on the mean of assigned points.
Research investigates the prediction of the last component of data points obtained from a distribution of clustered data using the online balanced k-means approach.
arXiv Detail & Related papers (2024-10-04T06:51:58Z) - Self-Supervised Graph Embedding Clustering [70.36328717683297]
K-means one-step dimensionality reduction clustering method has made some progress in addressing the curse of dimensionality in clustering tasks.
We propose a unified framework that integrates manifold learning with K-means, resulting in the self-supervised graph embedding framework.
arXiv Detail & Related papers (2024-09-24T08:59:51Z) - Rethinking k-means from manifold learning perspective [122.38667613245151]
We present a new clustering algorithm which directly detects clusters of data without mean estimation.
Specifically, we construct distance matrix between data points by Butterworth filter.
To well exploit the complementary information embedded in different views, we leverage the tensor Schatten p-norm regularization.
arXiv Detail & Related papers (2023-05-12T03:01:41Z) - Sketch-and-solve approaches to k-means clustering by semidefinite
programming [14.930208990741132]
We introduce a sketch-and-solve approach to speed up the Peng-Wei semidefinite relaxation of k-means clustering.
If the data is appropriately separated we identify the k-means optimal clustering.
Otherwise, our approach provides a high-confidence lower bound on the optimal k-means value.
arXiv Detail & Related papers (2022-11-28T19:51:30Z) - A new distance measurement and its application in K-Means Algorithm [7.168628921229442]
K-Means clustering algorithm based on Euclidean distance only pays attention to the linear distance between samples.
We propose a new distance measurement, namely, view-distance, and apply it to the K-Means algorithm.
The experimental results show that, on most datasets, the K-Means algorithm based on view-distance has a certain degree of improvement in classification accuracy and clustering effect.
arXiv Detail & Related papers (2022-06-10T16:26:22Z) - How to Use K-means for Big Data Clustering? [2.1165011830664677]
K-means is the simplest and most widely used algorithm under the Euclidean Minimum Sum-of-Squares Clustering (MSSC) model.
We propose a new parallel scheme of using K-means and K-means++ algorithms for big data clustering.
arXiv Detail & Related papers (2022-04-14T08:18:01Z) - Riemannian classification of EEG signals with missing values [67.90148548467762]
This paper proposes two strategies to handle missing data for the classification of electroencephalograms.
The first approach estimates the covariance from imputed data with the $k$-nearest neighbors algorithm; the second relies on the observed data by leveraging the observed-data likelihood within an expectation-maximization algorithm.
As results show, the proposed strategies perform better than the classification based on observed data and allow to keep a high accuracy even when the missing data ratio increases.
arXiv Detail & Related papers (2021-10-19T14:24:50Z) - Robust Trimmed k-means [70.88503833248159]
We propose Robust Trimmed k-means (RTKM) that simultaneously identifies outliers and clusters points.
We show RTKM performs competitively with other methods on single membership data with outliers and multi-membership data without outliers.
arXiv Detail & Related papers (2021-08-16T15:49:40Z) - Differentially Private Clustering: Tight Approximation Ratios [57.89473217052714]
We give efficient differentially private algorithms for basic clustering problems.
Our results imply an improved algorithm for the Sample and Aggregate privacy framework.
One of the tools used in our 1-Cluster algorithm can be employed to get a faster quantum algorithm for ClosestPair in a moderate number of dimensions.
arXiv Detail & Related papers (2020-08-18T16:22:06Z) - Decorrelated Clustering with Data Selection Bias [55.91842043124102]
We propose a novel Decorrelation regularized K-Means algorithm (DCKM) for clustering with data selection bias.
Our DCKM algorithm achieves significant performance gains, indicating the necessity of removing unexpected feature correlations induced by selection bias.
arXiv Detail & Related papers (2020-06-29T08:55:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.