Related papers: On the Efficiency of K-Means Clustering: Evaluation, Optimization, and Algorithm Selection

On the Efficiency of K-Means Clustering: Evaluation, Optimization, and Algorithm Selection

URL: http://arxiv.org/abs/2010.06654v2
Date: Tue, 27 Oct 2020 02:15:52 GMT
Title: On the Efficiency of K-Means Clustering: Evaluation, Optimization, and Algorithm Selection
Authors: Sheng Wang, Yuan Sun, Zhifeng Bao
Abstract summary: This paper presents a thorough evaluation of the existing methods that accelerate Lloyd's algorithm for fast k-means clustering. Within UniK, we thoroughly evaluate the pros and cons of existing methods using multiple performance metrics on a number of datasets. We derive an optimized algorithm over UniK, which effectively hybridizes multiple existing methods for more aggressive pruning.
Score: 20.900296096958446
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper presents a thorough evaluation of the existing methods that accelerate Lloyd's algorithm for fast k-means clustering. To do so, we analyze the pruning mechanisms of existing methods, and summarize their common pipeline into a unified evaluation framework UniK. UniK embraces a class of well-known methods and enables a fine-grained performance breakdown. Within UniK, we thoroughly evaluate the pros and cons of existing methods using multiple performance metrics on a number of datasets. Furthermore, we derive an optimized algorithm over UniK, which effectively hybridizes multiple existing methods for more aggressive pruning. To take this further, we investigate whether the most efficient method for a given clustering task can be automatically selected by machine learning, to benefit practitioners and researchers.

Related papers

K*-Means: A Parameter-free Clustering Algorithm [55.20132267309382]
k*-means is a novel clustering algorithm that eliminates the need to set k or any other parameters.<n>It uses the minimum description length principle to automatically determine the optimal number of clusters, k*, by splitting and merging clusters.<n>We prove that k*-means is guaranteed to converge and demonstrate experimentally that it significantly outperforms existing methods in scenarios where k is unknown.
arXiv Detail & Related papers (2025-05-17T08:41:07Z)
k-HyperEdge Medoids for Clustering Ensemble [18.340202398732632]
The clustering ensemble is formulated as a k-HyperEdge Medoids discovery problem. A clustering ensemble method based on k-HyperEdge Medoids is proposed. The convergence of the method is verified by experimental analysis of twenty data sets.
arXiv Detail & Related papers (2024-12-11T11:04:17Z)
Fuzzy K-Means Clustering without Cluster Centroids [21.256564324236333]
Fuzzy K-Means clustering is a critical technique in unsupervised data analysis. This paper proposes a novel Fuzzy textitK-Means clustering algorithm that entirely eliminates the reliance on cluster centroids.
arXiv Detail & Related papers (2024-04-07T12:25:03Z)
A Weighted K-Center Algorithm for Data Subset Selection [70.49696246526199]
Subset selection is a fundamental problem that can play a key role in identifying smaller portions of the training data. We develop a novel factor 3-approximation algorithm to compute subsets based on the weighted sum of both k-center and uncertainty sampling objective functions.
arXiv Detail & Related papers (2023-12-17T04:41:07Z)
Unified Functional Hashing in Automatic Machine Learning [58.77232199682271]
We show that large efficiency gains can be obtained by employing a fast unified functional hash. Our hash is "functional" in that it identifies equivalent candidates even if they were represented or coded differently. We show dramatic improvements on multiple AutoML domains, including neural architecture search and algorithm discovery.
arXiv Detail & Related papers (2023-02-10T18:50:37Z)
Rethinking Clustering-Based Pseudo-Labeling for Unsupervised Meta-Learning [146.11600461034746]
Method for unsupervised meta-learning, CACTUs, is a clustering-based approach with pseudo-labeling. This approach is model-agnostic and can be combined with supervised algorithms to learn from unlabeled data. We prove that the core reason for this is lack of a clustering-friendly property in the embedding space.
arXiv Detail & Related papers (2022-09-27T19:04:36Z)
A One-shot Framework for Distributed Clustered Learning in Heterogeneous Environments [54.172993875654015]
The paper proposes a family of communication efficient methods for distributed learning in heterogeneous environments. One-shot approach, based on local computations at the users and a clustering based aggregation step at the server is shown to provide strong learning guarantees. For strongly convex problems it is shown that, as long as the number of data points per user is above a threshold, the proposed approach achieves order-optimal mean-squared error rates in terms of the sample size.
arXiv Detail & Related papers (2022-09-22T09:04:10Z)
A sampling-based approach for efficient clustering in large datasets [0.8952229340927184]
We propose a simple and efficient clustering method for high-dimensional data with a large number of clusters. Our contribution is substantially more efficient than k-means as it does not require an all to all comparison of data points and clusters.
arXiv Detail & Related papers (2021-12-29T19:15:20Z)
Robust Trimmed k-means [70.88503833248159]
We propose Robust Trimmed k-means (RTKM) that simultaneously identifies outliers and clusters points. We show RTKM performs competitively with other methods on single membership data with outliers and multi-membership data without outliers.
arXiv Detail & Related papers (2021-08-16T15:49:40Z)
A bi-level encoding scheme for the clustered shortest-path tree problem in multifactorial optimization [1.471992435706872]
The Clustered Shortest-Path Tree Problem (CluSPT) plays an important role in various types of optimization problems in real-life. Recently, some Multifactorial Evolutionary Algorithm (MFEA) have been introduced to deal with the CluSPT. This paper describes a MFEA-based approach to solve the CluSPT.
arXiv Detail & Related papers (2021-02-12T13:36:07Z)
A self-adaptive and robust fission clustering algorithm via heat diffusion and maximal turning angle [4.246818236277977]
A novel and fast clustering algorithm, fission clustering algorithm, is proposed in recent year. We propose a robust fission clustering (RFC) algorithm and a self-adaptive noise identification method.
arXiv Detail & Related papers (2021-02-07T13:16:47Z)
Differentially Private Clustering: Tight Approximation Ratios [57.89473217052714]
We give efficient differentially private algorithms for basic clustering problems. Our results imply an improved algorithm for the Sample and Aggregate privacy framework. One of the tools used in our 1-Cluster algorithm can be employed to get a faster quantum algorithm for ClosestPair in a moderate number of dimensions.
arXiv Detail & Related papers (2020-08-18T16:22:06Z)
A semi-supervised sparse K-Means algorithm [3.04585143845864]
An unsupervised sparse clustering method can be employed in order to detect the subgroup of features necessary for clustering. A semi-supervised method can use the labelled data to create constraints and enhance the clustering solution. We show that the algorithm maintains the high performance of other semi-supervised algorithms and in addition preserves the ability to identify informative from uninformative features.
arXiv Detail & Related papers (2020-03-16T02:05:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.