Related papers: Fast and Eager k-Medoids Clustering: O(k) Runtime Improvement of the PAM, CLARA, and CLARANS Algorithms

Fast and Eager k-Medoids Clustering: O(k) Runtime Improvement of the PAM, CLARA, and CLARANS Algorithms

URL: http://arxiv.org/abs/2008.05171v2
Date: Tue, 1 Jun 2021 08:16:45 GMT
Title: Fast and Eager k-Medoids Clustering: O(k) Runtime Improvement of the PAM, CLARA, and CLARANS Algorithms
Authors: Erich Schubert and Peter J. Rousseeuw
Abstract summary: Partitioning Around Medoids (PAM) is an algorithm for clustering non-Euclidean data. We propose modifications to PAM that achieve an O(k)-fold speedup in the second ("SWAP") phase of the algorithm. In experiments on real data with k=100,200, we observed a 458x respectively 1191x speedup compared to the original PAM SWAP algorithm.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Clustering non-Euclidean data is difficult, and one of the most used algorithms besides hierarchical clustering is the popular algorithm Partitioning Around Medoids (PAM), also simply referred to as k-medoids clustering. In Euclidean geometry the mean-as used in k-means-is a good estimator for the cluster center, but this does not exist for arbitrary dissimilarities. PAM uses the medoid instead, the object with the smallest dissimilarity to all others in the cluster. This notion of centrality can be used with any (dis-)similarity, and thus is of high relevance to many domains and applications. A key issue with PAM is its high run time cost. We propose modifications to the PAM algorithm that achieve an O(k)-fold speedup in the second ("SWAP") phase of the algorithm, but will still find the same results as the original PAM algorithm. If we relax the choice of swaps performed (while retaining comparable quality), we can further accelerate the algorithm by eagerly performing additional swaps in each iteration. With the substantially faster SWAP, we can now explore faster initialization strategies, because (i) the classic ("BUILD") initialization now becomes the bottleneck, and (ii) our swap is fast enough to compensate for worse starting conditions. We also show how the CLARA and CLARANS algorithms benefit from the proposed modifications. While we do not study the parallelization of our approach in this work, it can easily be combined with earlier approaches to use PAM and CLARA on big data (some of which use PAM as a subroutine, hence can immediately benefit from these improvements), where the performance with high k becomes increasingly important. In experiments on real data with k=100,200, we observed a 458x respectively 1191x speedup compared to the original PAM SWAP algorithm, making PAM applicable to larger data sets, and in particular to higher k.

Related papers

Rock the KASBA: Blazingly Fast and Accurate Time Series Clustering [0.6215404942415159]
We present a new time series clustering (TSCL) algorithm, the $k$-means (K) accelerated (A) subgradient (S) Barycentre (B) Average (A) KASBA is a $k$-means clustering algorithm that uses the Move-Split-Merge (MSM) elastic distance at all stages of clustering, applies a randomised subgradient descent to find barycentre centroids, links each stage of clustering to accelerate convergence and exploits the metric property of MSM distance to avoid a large proportion of distance calculations. It is a versatile and scalable clusterer designed for
arXiv Detail & Related papers (2024-11-26T19:26:17Z)
An Efficient Algorithm for Clustered Multi-Task Compressive Sensing [60.70532293880842]
Clustered multi-task compressive sensing is a hierarchical model that solves multiple compressive sensing tasks. The existing inference algorithm for this model is computationally expensive and does not scale well in high dimensions. We propose a new algorithm that substantially accelerates model inference by avoiding the need to explicitly compute these covariance matrices.
arXiv Detail & Related papers (2023-09-30T15:57:14Z)
Rethinking k-means from manifold learning perspective [122.38667613245151]
We present a new clustering algorithm which directly detects clusters of data without mean estimation. Specifically, we construct distance matrix between data points by Butterworth filter. To well exploit the complementary information embedded in different views, we leverage the tensor Schatten p-norm regularization.
arXiv Detail & Related papers (2023-05-12T03:01:41Z)
GBMST: An Efficient Minimum Spanning Tree Clustering Based on Granular-Ball Computing [78.92205914422925]
We propose a clustering algorithm that combines multi-granularity Granular-Ball and minimum spanning tree (MST) We construct coarsegrained granular-balls, and then use granular-balls and MST to implement the clustering method based on "large-scale priority" Experimental results on several data sets demonstrate the power of the algorithm.
arXiv Detail & Related papers (2023-03-02T09:04:35Z)
Regularization and Optimization in Model-Based Clustering [4.096453902709292]
k-means algorithm variants essentially fit a mixture of identical spherical Gaussians to data that vastly deviates from such a distribution. We develop more effective optimization algorithms for general GMMs, and we combine these algorithms with regularization strategies that avoid overfitting. These results shed new light on the current status quo between GMM and k-means methods and suggest the more frequent use of general GMMs for data exploration.
arXiv Detail & Related papers (2023-02-05T18:22:29Z)
Careful Seeding for k-Medois Clustering with Incremental k-Means++ Initialization [17.4921582710817]
K-medoids clustering is a popular variant of k-means clustering and widely used in pattern recognition and machine learning. An improved k-medoids clustering algorithm, called INCKM algorithm, was recently proposed to overcome this drawback. We propose a novel k-medoids clustering algorithm, called incremental k-means++ (INCKPP) algorithm, which initializes with a novel incremental manner.
arXiv Detail & Related papers (2022-07-06T02:25:35Z)
A Review and Evaluation of Elastic Distance Functions for Time Series Clustering [0.0]
We describe nine commonly used elastic distance measures and compare their performance with k-means and k-medoids clustering. The most popular technique, dynamic time warping (DTW), performs worse than Euclidean distance with k-means, and even when tuned, is no better. Our conclusion is to recommend MSM with k-medoids as the benchmark algorithm for clustering time series with elastic distance measures.
arXiv Detail & Related papers (2022-05-30T15:32:55Z)
K-Splits: Improved K-Means Clustering Algorithm to Automatically Detect the Number of Clusters [0.12313056815753944]
This paper introduces k-splits, an improved hierarchical algorithm based on k-means to cluster data without prior knowledge of the number of clusters. Accuracy and speed are two main advantages of the proposed method.
arXiv Detail & Related papers (2021-10-09T23:02:57Z)
Determinantal consensus clustering [77.34726150561087]
We propose the use of determinantal point processes or DPP for the random restart of clustering algorithms. DPPs favor diversity of the center points within subsets. We show through simulations that, contrary to DPP, this technique fails both to ensure diversity, and to obtain a good coverage of all data facets.
arXiv Detail & Related papers (2021-02-07T23:48:24Z)
(k, l)-Medians Clustering of Trajectories Using Continuous Dynamic Time Warping [57.316437798033974]
In this work we consider the problem of center-based clustering of trajectories. We propose the usage of a continuous version of DTW as distance measure, which we call continuous dynamic time warping (CDTW) We show a practical way to compute a center from a set of trajectories and subsequently iteratively improve it.
arXiv Detail & Related papers (2020-12-01T13:17:27Z)
Differentially Private Clustering: Tight Approximation Ratios [57.89473217052714]
We give efficient differentially private algorithms for basic clustering problems. Our results imply an improved algorithm for the Sample and Aggregate privacy framework. One of the tools used in our 1-Cluster algorithm can be employed to get a faster quantum algorithm for ClosestPair in a moderate number of dimensions.
arXiv Detail & Related papers (2020-08-18T16:22:06Z)
An Efficient Smoothing Proximal Gradient Algorithm for Convex Clustering [2.5182813818441945]
Recently introduced convex clustering approach formulates clustering as a convex optimization problem. State-of-the-art convex clustering algorithms require large computation and memory space. In this paper, we develop a very efficient smoothing gradient algorithm (Sproga) for convex clustering.
arXiv Detail & Related papers (2020-06-22T20:02:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.