Exact Acceleration of K-Means++ and K-Means$\|$
- URL: http://arxiv.org/abs/2105.02936v1
- Date: Thu, 6 May 2021 20:22:55 GMT
- Title: Exact Acceleration of K-Means++ and K-Means$\|$
- Authors: Edward Raff
- Abstract summary: K-Means++ and K-Means$|$ have become de facto tools for selecting the initial seeds of K-means.
We develop specialized triangle inequality pruning strategies and a dynamic priority queue to show the first acceleration of K-Means++ and K-Means$|$.
- Score: 22.66983713481359
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: K-Means++ and its distributed variant K-Means$\|$ have become de facto tools
for selecting the initial seeds of K-means. While alternatives have been
developed, the effectiveness, ease of implementation, and theoretical grounding
of the K-means++ and $\|$ methods have made them difficult to "best" from a
holistic perspective. By considering the limited opportunities within seed
selection to perform pruning, we develop specialized triangle inequality
pruning strategies and a dynamic priority queue to show the first acceleration
of K-Means++ and K-Means$\|$ that is faster in run-time while being
algorithmicly equivalent. For both algorithms we are able to reduce distance
computations by over $500\times$. For K-means++ this results in up to a
17$\times$ speedup in run-time and a $551\times$ speedup for K-means$\|$. We
achieve this with simple, but carefully chosen, modifications to known
techniques which makes it easy to integrate our approach into existing
implementations of these algorithms.
Related papers
- Almost-linear Time Approximation Algorithm to Euclidean $k$-median and $k$-means [4.271492285528115]
We focus on the Euclidean $k$-median and $k$-means problems, two of the standard ways to model the task of clustering.
In this paper, we almost answer this question by presenting an almost linear-time algorithm to compute a constant-factor approximation.
arXiv Detail & Related papers (2024-07-15T20:04:06Z) - A Scalable Algorithm for Individually Fair K-means Clustering [77.93955971520549]
We present a scalable algorithm for the individually fair ($p$, $k$)-clustering problem introduced by Jung et al. and Mahabadi et al.
A clustering is then called individually fair if it has centers within distance $delta(x)$ of $x$ for each $xin P$.
We show empirically that not only is our algorithm much faster than prior work, but it also produces lower-cost solutions.
arXiv Detail & Related papers (2024-02-09T19:01:48Z) - Private estimation algorithms for stochastic block models and mixture
models [63.07482515700984]
General tools for designing efficient private estimation algorithms.
First efficient $(epsilon, delta)$-differentially private algorithm for both weak recovery and exact recovery.
arXiv Detail & Related papers (2023-01-11T09:12:28Z) - Scalable Differentially Private Clustering via Hierarchically Separated
Trees [82.69664595378869]
We show that our method computes a solution with cost at most $O(d3/2log n)cdot OPT + O(k d2 log2 n / epsilon2)$, where $epsilon$ is the privacy guarantee.
Although the worst-case guarantee is worse than that of state of the art private clustering methods, the algorithm we propose is practical.
arXiv Detail & Related papers (2022-06-17T09:24:41Z) - K-Splits: Improved K-Means Clustering Algorithm to Automatically Detect
the Number of Clusters [0.12313056815753944]
This paper introduces k-splits, an improved hierarchical algorithm based on k-means to cluster data without prior knowledge of the number of clusters.
Accuracy and speed are two main advantages of the proposed method.
arXiv Detail & Related papers (2021-10-09T23:02:57Z) - Clustering Mixture Models in Almost-Linear Time via List-Decodable Mean
Estimation [58.24280149662003]
We study the problem of list-decodable mean estimation, where an adversary can corrupt a majority of the dataset.
We develop new algorithms for list-decodable mean estimation, achieving nearly-optimal statistical guarantees.
arXiv Detail & Related papers (2021-06-16T03:34:14Z) - List-Decodable Mean Estimation in Nearly-PCA Time [50.79691056481693]
We study the fundamental task of list-decodable mean estimation in high dimensions.
Our algorithm runs in time $widetildeO(ndk)$ for all $k = O(sqrtd) cup Omega(d)$, where $n$ is the size of the dataset.
A variant of our algorithm has runtime $widetildeO(ndk)$ for all $k$, at the expense of an $O(sqrtlog k)$ factor in the recovery guarantee
arXiv Detail & Related papers (2020-11-19T17:21:37Z) - Differentially Private Clustering: Tight Approximation Ratios [57.89473217052714]
We give efficient differentially private algorithms for basic clustering problems.
Our results imply an improved algorithm for the Sample and Aggregate privacy framework.
One of the tools used in our 1-Cluster algorithm can be employed to get a faster quantum algorithm for ClosestPair in a moderate number of dimensions.
arXiv Detail & Related papers (2020-08-18T16:22:06Z) - Breathing K-Means: Superior K-Means Solutions through Dynamic K-Values [0.0]
We introduce the breathing k-means algorithm, which on average significantly improves solutions obtained by the widely-known greedy k-means++ algorithm.
The improvements are achieved through a novel breathing'' technique, that cyclically increases and decreases the number of centroids based on local error and utility measures.
arXiv Detail & Related papers (2020-06-28T17:49:37Z) - Improving The Performance Of The K-means Algorithm [2.28438857884398]
My thesis proposes two algorithms to speed up IKM while remaining the quality of its clustering result approximately.
The first algorithm, called Divisive K-means, improves the speed of IKM by speeding up its splitting process of clusters.
The second algorithm, called Parallel Two-Phase K-means (Par2PK-means), parallelizes IKM by employing the model of Two-Phase K-means.
arXiv Detail & Related papers (2020-05-10T15:09:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.