Fast and Accurate $k$-means++ via Rejection Sampling
- URL: http://arxiv.org/abs/2012.11891v1
- Date: Tue, 22 Dec 2020 09:14:41 GMT
- Title: Fast and Accurate $k$-means++ via Rejection Sampling
- Authors: Vincent Cohen-Addad and Silvio Lattanzi and Ashkan Norouzi-Fard and
Christian Sohler and Ola Svensson
- Abstract summary: $k$-means++ sometimes suffers from being slow on large data-sets.
We present a near linear time algorithm for $k$-means++ seeding.
- Score: 44.155614837392214
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: $k$-means++ \cite{arthur2007k} is a widely used clustering algorithm that is
easy to implement, has nice theoretical guarantees and strong empirical
performance. Despite its wide adoption, $k$-means++ sometimes suffers from
being slow on large data-sets so a natural question has been to obtain more
efficient algorithms with similar guarantees. In this paper, we present a near
linear time algorithm for $k$-means++ seeding. Interestingly our algorithm
obtains the same theoretical guarantees as $k$-means++ and significantly
improves earlier results on fast $k$-means++ seeding. Moreover, we show
empirically that our algorithm is significantly faster than $k$-means++ and
obtains solutions of equivalent quality.
Related papers
- Almost-linear Time Approximation Algorithm to Euclidean $k$-median and $k$-means [4.271492285528115]
We focus on the Euclidean $k$-median and $k$-means problems, two of the standard ways to model the task of clustering.
In this paper, we almost answer this question by presenting an almost linear-time algorithm to compute a constant-factor approximation.
arXiv Detail & Related papers (2024-07-15T20:04:06Z) - Perturb-and-Project: Differentially Private Similarities and Marginals [73.98880839337873]
We revisit the input perturbations framework for differential privacy where noise is added to the input $Ain mathcalS$.
We first design novel efficient algorithms to privately release pair-wise cosine similarities.
We derive a novel algorithm to compute $k$-way marginal queries over $n$ features.
arXiv Detail & Related papers (2024-06-07T12:07:16Z) - A Scalable Algorithm for Individually Fair K-means Clustering [77.93955971520549]
We present a scalable algorithm for the individually fair ($p$, $k$)-clustering problem introduced by Jung et al. and Mahabadi et al.
A clustering is then called individually fair if it has centers within distance $delta(x)$ of $x$ for each $xin P$.
We show empirically that not only is our algorithm much faster than prior work, but it also produces lower-cost solutions.
arXiv Detail & Related papers (2024-02-09T19:01:48Z) - Multi-Swap $k$-Means++ [30.967186562175893]
The $k$-means++ algorithm of Arthur and Vassilvitskii (SODA 2007) is often the practitioners' choice algorithm for optimizing the popular $k$-means clustering objective.
Lattanzi and Sohler (ICML) proposed augmenting $k$-means++ with $O(k log log k)$ local search steps to yield a $c$-approximation to the $k$-means clustering problem.
arXiv Detail & Related papers (2023-09-28T12:31:35Z) - Do you know what q-means? [50.045011844765185]
Clustering is one of the most important tools for analysis of large datasets.
We present an improved version of the "$q$-means" algorithm for clustering.
We also present a "dequantized" algorithm for $varepsilon which runs in $Obig(frack2varepsilon2(sqrtkd + log(Nd))big.
arXiv Detail & Related papers (2023-08-18T17:52:12Z) - Private estimation algorithms for stochastic block models and mixture
models [63.07482515700984]
General tools for designing efficient private estimation algorithms.
First efficient $(epsilon, delta)$-differentially private algorithm for both weak recovery and exact recovery.
arXiv Detail & Related papers (2023-01-11T09:12:28Z) - A Faster $k$-means++ Algorithm [11.428775569173638]
We present a new algorithm that can solve the $k$-means++ problem with nearly optimal running time.
We propose a new algorithm textscFastKmeans++ that only takes in $widetildeO(nd + nk2)$ time, in total.
arXiv Detail & Related papers (2022-11-28T08:17:12Z) - Global $k$-means$++$: an effective relaxation of the global $k$-means
clustering algorithm [0.20305676256390928]
The $k$-means algorithm is a prevalent clustering method due to its simplicity, effectiveness, and speed.
We propose the emphglobal $k$-meanstexttt++ clustering algorithm, which is an effective way of acquiring quality clustering solutions.
arXiv Detail & Related papers (2022-11-22T13:42:53Z) - Scalable Differentially Private Clustering via Hierarchically Separated
Trees [82.69664595378869]
We show that our method computes a solution with cost at most $O(d3/2log n)cdot OPT + O(k d2 log2 n / epsilon2)$, where $epsilon$ is the privacy guarantee.
Although the worst-case guarantee is worse than that of state of the art private clustering methods, the algorithm we propose is practical.
arXiv Detail & Related papers (2022-06-17T09:24:41Z) - Improved Guarantees for k-means++ and k-means++ Parallel [18.26254785549146]
k-means++ and k-means++ parallel are popular algorithms for the classic k-means clustering problem.
We show improved approximation and bi-criteria approximation guarantees for k-means++ and k-means++ parallel.
We propose a new variant of k-means++ parallel algorithm (Exponential Race k-means++) that has the same approximation guarantees as k-means++.
arXiv Detail & Related papers (2020-10-27T17:46:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.