A Nearly Tight Analysis of Greedy k-means++
- URL: http://arxiv.org/abs/2207.07949v1
- Date: Sat, 16 Jul 2022 13:49:07 GMT
- Title: A Nearly Tight Analysis of Greedy k-means++
- Authors: Christoph Grunau, Ahmet Alper \"Oz\"udo\u{g}ru, V\'aclav Rozho\v{n},
Jakub T\v{e}tek
- Abstract summary: The $k$-means++ algorithm is known to return a $Theta(log k)$ approximate solution in expectation.
We present nearly matching lower and upper bounds for the greedy $k$-means++ algorithm.
- Score: 1.452875650827562
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The famous $k$-means++ algorithm of Arthur and Vassilvitskii [SODA 2007] is
the most popular way of solving the $k$-means problem in practice. The
algorithm is very simple: it samples the first center uniformly at random and
each of the following $k-1$ centers is then always sampled proportional to its
squared distance to the closest center so far. Afterward, Lloyd's iterative
algorithm is run. The $k$-means++ algorithm is known to return a $\Theta(\log
k)$ approximate solution in expectation.
In their seminal work, Arthur and Vassilvitskii [SODA 2007] asked about the
guarantees for its following \emph{greedy} variant: in every step, we sample
$\ell$ candidate centers instead of one and then pick the one that minimizes
the new cost. This is also how $k$-means++ is implemented in e.g. the popular
Scikit-learn library [Pedregosa et al.; JMLR 2011].
We present nearly matching lower and upper bounds for the greedy $k$-means++:
We prove that it is an $O(\ell^3 \log^3 k)$-approximation algorithm. On the
other hand, we prove a lower bound of $\Omega(\ell^3 \log^3 k / \log^2(\ell\log
k))$. Previously, only an $\Omega(\ell \log k)$ lower bound was known
[Bhattacharya, Eube, R\"oglin, Schmidt; ESA 2020] and there was no known upper
bound.
Related papers
- A Scalable Algorithm for Individually Fair K-means Clustering [77.93955971520549]
We present a scalable algorithm for the individually fair ($p$, $k$)-clustering problem introduced by Jung et al. and Mahabadi et al.
A clustering is then called individually fair if it has centers within distance $delta(x)$ of $x$ for each $xin P$.
We show empirically that not only is our algorithm much faster than prior work, but it also produces lower-cost solutions.
arXiv Detail & Related papers (2024-02-09T19:01:48Z) - Do you know what q-means? [50.045011844765185]
Clustering is one of the most important tools for analysis of large datasets.
We present an improved version of the "$q$-means" algorithm for clustering.
We also present a "dequantized" algorithm for $varepsilon which runs in $Obig(frack2varepsilon2(sqrtkd + log(Nd))big.
arXiv Detail & Related papers (2023-08-18T17:52:12Z) - A Faster $k$-means++ Algorithm [11.428775569173638]
We present a new algorithm that can solve the $k$-means++ problem with nearly optimal running time.
We propose a new algorithm textscFastKmeans++ that only takes in $widetildeO(nd + nk2)$ time, in total.
arXiv Detail & Related papers (2022-11-28T08:17:12Z) - Logarithmic Regret from Sublinear Hints [76.87432703516942]
We show that an algorithm can obtain $O(log T)$ regret with just $O(sqrtT)$ hints under a natural query model.
We also show that $o(sqrtT)$ hints cannot guarantee better than $Omega(sqrtT)$ regret.
arXiv Detail & Related papers (2021-11-09T16:50:18Z) - Near-optimal Algorithms for Explainable k-Medians and k-Means [12.68470213641421]
We consider the problem of explainable $k$-medians and $k$-means introduced by Dasgupta, Frost, Moshkovitz, and Rashtchian(ICML 2020)
Our goal is to find a emphoptimal decision tree that partitions data into $k clusters and minimizes the $k-medians or $k-means objective.
arXiv Detail & Related papers (2021-07-02T02:07:12Z) - Almost Tight Approximation Algorithms for Explainable Clustering [16.22135057266913]
We study a recent framework of explainable clustering first suggested by Dasgupta et al.
Specifically, we focus on the $k$-means and $k$-medians problems and provide nearly tight upper and lower bounds.
arXiv Detail & Related papers (2021-07-01T23:49:23Z) - Clustering Mixture Models in Almost-Linear Time via List-Decodable Mean
Estimation [58.24280149662003]
We study the problem of list-decodable mean estimation, where an adversary can corrupt a majority of the dataset.
We develop new algorithms for list-decodable mean estimation, achieving nearly-optimal statistical guarantees.
arXiv Detail & Related papers (2021-06-16T03:34:14Z) - Streaming Complexity of SVMs [110.63976030971106]
We study the space complexity of solving the bias-regularized SVM problem in the streaming model.
We show that for both problems, for dimensions of $frac1lambdaepsilon$, one can obtain streaming algorithms with spacely smaller than $frac1lambdaepsilon$.
arXiv Detail & Related papers (2020-07-07T17:10:00Z) - Adapting $k$-means algorithms for outliers [1.9290392443571387]
We show how to adapt several simple sampling-based algorithms for the $k$-means problem to the setting with outliers.
Our algorithms output $(varepsilon)z$ outliers while achieving an $O(varepsilon)$-approximation to the objective function.
arXiv Detail & Related papers (2020-07-02T14:14:33Z) - Agnostic Q-learning with Function Approximation in Deterministic
Systems: Tight Bounds on Approximation Error and Sample Complexity [94.37110094442136]
We study the problem of agnostic $Q$-learning with function approximation in deterministic systems.
We show that if $delta = Oleft(rho/sqrtdim_Eright)$, then one can find the optimal policy using $Oleft(dim_Eright)$.
arXiv Detail & Related papers (2020-02-17T18:41:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.