Fast and Eager k-Medoids Clustering: O(k) Runtime Improvement of the
PAM, CLARA, and CLARANS Algorithms
- URL: http://arxiv.org/abs/2008.05171v2
- Date: Tue, 1 Jun 2021 08:16:45 GMT
- Title: Fast and Eager k-Medoids Clustering: O(k) Runtime Improvement of the
PAM, CLARA, and CLARANS Algorithms
- Authors: Erich Schubert and Peter J. Rousseeuw
- Abstract summary: Partitioning Around Medoids (PAM) is an algorithm for clustering non-Euclidean data.
We propose modifications to PAM that achieve an O(k)-fold speedup in the second ("SWAP") phase of the algorithm.
In experiments on real data with k=100,200, we observed a 458x respectively 1191x speedup compared to the original PAM SWAP algorithm.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Clustering non-Euclidean data is difficult, and one of the most used
algorithms besides hierarchical clustering is the popular algorithm
Partitioning Around Medoids (PAM), also simply referred to as k-medoids
clustering. In Euclidean geometry the mean-as used in k-means-is a good
estimator for the cluster center, but this does not exist for arbitrary
dissimilarities. PAM uses the medoid instead, the object with the smallest
dissimilarity to all others in the cluster. This notion of centrality can be
used with any (dis-)similarity, and thus is of high relevance to many domains
and applications. A key issue with PAM is its high run time cost. We propose
modifications to the PAM algorithm that achieve an O(k)-fold speedup in the
second ("SWAP") phase of the algorithm, but will still find the same results as
the original PAM algorithm. If we relax the choice of swaps performed (while
retaining comparable quality), we can further accelerate the algorithm by
eagerly performing additional swaps in each iteration. With the substantially
faster SWAP, we can now explore faster initialization strategies, because (i)
the classic ("BUILD") initialization now becomes the bottleneck, and (ii) our
swap is fast enough to compensate for worse starting conditions. We also show
how the CLARA and CLARANS algorithms benefit from the proposed modifications.
While we do not study the parallelization of our approach in this work, it can
easily be combined with earlier approaches to use PAM and CLARA on big data
(some of which use PAM as a subroutine, hence can immediately benefit from
these improvements), where the performance with high k becomes increasingly
important. In experiments on real data with k=100,200, we observed a 458x
respectively 1191x speedup compared to the original PAM SWAP algorithm, making
PAM applicable to larger data sets, and in particular to higher k.
Related papers
- An Efficient Algorithm for Clustered Multi-Task Compressive Sensing [60.70532293880842]
Clustered multi-task compressive sensing is a hierarchical model that solves multiple compressive sensing tasks.
The existing inference algorithm for this model is computationally expensive and does not scale well in high dimensions.
We propose a new algorithm that substantially accelerates model inference by avoiding the need to explicitly compute these covariance matrices.
arXiv Detail & Related papers (2023-09-30T15:57:14Z) - Rethinking k-means from manifold learning perspective [122.38667613245151]
We present a new clustering algorithm which directly detects clusters of data without mean estimation.
Specifically, we construct distance matrix between data points by Butterworth filter.
To well exploit the complementary information embedded in different views, we leverage the tensor Schatten p-norm regularization.
arXiv Detail & Related papers (2023-05-12T03:01:41Z) - GBMST: An Efficient Minimum Spanning Tree Clustering Based on
Granular-Ball Computing [78.92205914422925]
We propose a clustering algorithm that combines multi-granularity Granular-Ball and minimum spanning tree (MST)
We construct coarsegrained granular-balls, and then use granular-balls and MST to implement the clustering method based on "large-scale priority"
Experimental results on several data sets demonstrate the power of the algorithm.
arXiv Detail & Related papers (2023-03-02T09:04:35Z) - Regularization and Optimization in Model-Based Clustering [4.096453902709292]
k-means algorithm variants essentially fit a mixture of identical spherical Gaussians to data that vastly deviates from such a distribution.
We develop more effective optimization algorithms for general GMMs, and we combine these algorithms with regularization strategies that avoid overfitting.
These results shed new light on the current status quo between GMM and k-means methods and suggest the more frequent use of general GMMs for data exploration.
arXiv Detail & Related papers (2023-02-05T18:22:29Z) - A Review and Evaluation of Elastic Distance Functions for Time Series
Clustering [0.0]
We describe nine commonly used elastic distance measures and compare their performance with k-means and k-medoids clustering.
The most popular technique, dynamic time warping (DTW), performs worse than Euclidean distance with k-means, and even when tuned, is no better.
Our conclusion is to recommend MSM with k-medoids as the benchmark algorithm for clustering time series with elastic distance measures.
arXiv Detail & Related papers (2022-05-30T15:32:55Z) - K-Splits: Improved K-Means Clustering Algorithm to Automatically Detect
the Number of Clusters [0.12313056815753944]
This paper introduces k-splits, an improved hierarchical algorithm based on k-means to cluster data without prior knowledge of the number of clusters.
Accuracy and speed are two main advantages of the proposed method.
arXiv Detail & Related papers (2021-10-09T23:02:57Z) - Determinantal consensus clustering [77.34726150561087]
We propose the use of determinantal point processes or DPP for the random restart of clustering algorithms.
DPPs favor diversity of the center points within subsets.
We show through simulations that, contrary to DPP, this technique fails both to ensure diversity, and to obtain a good coverage of all data facets.
arXiv Detail & Related papers (2021-02-07T23:48:24Z) - (k, l)-Medians Clustering of Trajectories Using Continuous Dynamic Time
Warping [57.316437798033974]
In this work we consider the problem of center-based clustering of trajectories.
We propose the usage of a continuous version of DTW as distance measure, which we call continuous dynamic time warping (CDTW)
We show a practical way to compute a center from a set of trajectories and subsequently iteratively improve it.
arXiv Detail & Related papers (2020-12-01T13:17:27Z) - Differentially Private Clustering: Tight Approximation Ratios [57.89473217052714]
We give efficient differentially private algorithms for basic clustering problems.
Our results imply an improved algorithm for the Sample and Aggregate privacy framework.
One of the tools used in our 1-Cluster algorithm can be employed to get a faster quantum algorithm for ClosestPair in a moderate number of dimensions.
arXiv Detail & Related papers (2020-08-18T16:22:06Z) - An Efficient Smoothing Proximal Gradient Algorithm for Convex Clustering [2.5182813818441945]
Recently introduced convex clustering approach formulates clustering as a convex optimization problem.
State-of-the-art convex clustering algorithms require large computation and memory space.
In this paper, we develop a very efficient smoothing gradient algorithm (Sproga) for convex clustering.
arXiv Detail & Related papers (2020-06-22T20:02:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.