Related papers: Fast and Accurate $k$-means++ via Rejection Sampling

Fast and Accurate $k$-means++ via Rejection Sampling

URL: http://arxiv.org/abs/2012.11891v1
Date: Tue, 22 Dec 2020 09:14:41 GMT
Title: Fast and Accurate $k$-means++ via Rejection Sampling
Authors: Vincent Cohen-Addad and Silvio Lattanzi and Ashkan Norouzi-Fard and Christian Sohler and Ola Svensson
Abstract summary: $k$-means++ sometimes suffers from being slow on large data-sets. We present a near linear time algorithm for $k$-means++ seeding.
Score: 44.155614837392214
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: $k$-means++ \cite{arthur2007k} is a widely used clustering algorithm that is easy to implement, has nice theoretical guarantees and strong empirical performance. Despite its wide adoption, $k$-means++ sometimes suffers from being slow on large data-sets so a natural question has been to obtain more efficient algorithms with similar guarantees. In this paper, we present a near linear time algorithm for $k$-means++ seeding. Interestingly our algorithm obtains the same theoretical guarantees as $k$-means++ and significantly improves earlier results on fast $k$-means++ seeding. Moreover, we show empirically that our algorithm is significantly faster than $k$-means++ and obtains solutions of equivalent quality.

Related papers

Provably faster randomized and quantum algorithms for k-means clustering via uniform sampling [3.0522144048108513]
We describe a simple randomized mini-batch $k$-means algorithm and a quantum algorithm inspired by the classical algorithm. Our improvements are due to a careful use of uniform sampling, which preserves certain symmetries of the $k$-means problem.
arXiv Detail & Related papers (2025-04-29T17:51:29Z)
Log-Time K-Means Clustering for 1D Data: Novel Approaches with Proof and Implementation [0.0]
This thesis bridges theory and practice for 1D $k$-means clustering, delivering efficient and sound algorithms. Benchmarks demonstrate over a 4500x speedup compared to scikit-learn for large datasets.
arXiv Detail & Related papers (2024-12-19T09:03:39Z)
Almost-linear Time Approximation Algorithm to Euclidean $k$-median and $k$-means [4.271492285528115]
We focus on the Euclidean $k$-median and $k$-means problems, two of the standard ways to model the task of clustering. In this paper, we almost answer this question by presenting an almost linear-time algorithm to compute a constant-factor approximation.
arXiv Detail & Related papers (2024-07-15T20:04:06Z)
Perturb-and-Project: Differentially Private Similarities and Marginals [73.98880839337873]
We revisit the input perturbations framework for differential privacy where noise is added to the input $Ain mathcalS$. We first design novel efficient algorithms to privately release pair-wise cosine similarities. We derive a novel algorithm to compute $k$-way marginal queries over $n$ features.
arXiv Detail & Related papers (2024-06-07T12:07:16Z)
A Scalable Algorithm for Individually Fair K-means Clustering [77.93955971520549]
We present a scalable algorithm for the individually fair ($p$, $k$)-clustering problem introduced by Jung et al. and Mahabadi et al. A clustering is then called individually fair if it has centers within distance $delta(x)$ of $x$ for each $xin P$. We show empirically that not only is our algorithm much faster than prior work, but it also produces lower-cost solutions.
arXiv Detail & Related papers (2024-02-09T19:01:48Z)
Multi-Swap $k$-Means++ [30.967186562175893]
The $k$-means++ algorithm of Arthur and Vassilvitskii (SODA 2007) is often the practitioners' choice algorithm for optimizing the popular $k$-means clustering objective. Lattanzi and Sohler (ICML) proposed augmenting $k$-means++ with $O(k log log k)$ local search steps to yield a $c$-approximation to the $k$-means clustering problem.
arXiv Detail & Related papers (2023-09-28T12:31:35Z)
Do you know what q-means? [50.045011844765185]
Clustering is one of the most important tools for analysis of large datasets. We present an improved version of the "$q$-means" algorithm for clustering. We also present a "dequantized" algorithm for $varepsilon which runs in $Obig(frack2varepsilon2(sqrtkd + log(Nd))big.
arXiv Detail & Related papers (2023-08-18T17:52:12Z)
Efficiently Learning One-Hidden-Layer ReLU Networks via Schur Polynomials [50.90125395570797]
We study the problem of PAC learning a linear combination of $k$ ReLU activations under the standard Gaussian distribution on $mathbbRd$ with respect to the square loss. Our main result is an efficient algorithm for this learning task with sample and computational complexity $(dk/epsilon)O(k)$, whereepsilon>0$ is the target accuracy.
arXiv Detail & Related papers (2023-07-24T14:37:22Z)
Private estimation algorithms for stochastic block models and mixture models [63.07482515700984]
General tools for designing efficient private estimation algorithms. First efficient $(epsilon, delta)$-differentially private algorithm for both weak recovery and exact recovery.
arXiv Detail & Related papers (2023-01-11T09:12:28Z)
A Faster $k$-means++ Algorithm [11.428775569173638]
We present a new algorithm that can solve the $k$-means++ problem with nearly optimal running time. We propose a new algorithm textscFastKmeans++ that only takes in $widetildeO(nd + nk2)$ time, in total.
arXiv Detail & Related papers (2022-11-28T08:17:12Z)
Global $k$-means$++$: an effective relaxation of the global $k$-means clustering algorithm [0.20305676256390928]
The $k$-means algorithm is a prevalent clustering method due to its simplicity, effectiveness, and speed. We propose the emphglobal $k$-meanstexttt++ clustering algorithm, which is an effective way of acquiring quality clustering solutions.
arXiv Detail & Related papers (2022-11-22T13:42:53Z)
Scalable Differentially Private Clustering via Hierarchically Separated Trees [82.69664595378869]
We show that our method computes a solution with cost at most $O(d3/2log n)cdot OPT + O(k d2 log2 n / epsilon2)$, where $epsilon$ is the privacy guarantee. Although the worst-case guarantee is worse than that of state of the art private clustering methods, the algorithm we propose is practical.
arXiv Detail & Related papers (2022-06-17T09:24:41Z)
Improved Guarantees for k-means++ and k-means++ Parallel [18.26254785549146]
k-means++ and k-means++ parallel are popular algorithms for the classic k-means clustering problem. We show improved approximation and bi-criteria approximation guarantees for k-means++ and k-means++ parallel. We propose a new variant of k-means++ parallel algorithm (Exponential Race k-means++) that has the same approximation guarantees as k-means++.
arXiv Detail & Related papers (2020-10-27T17:46:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.