Mini-batch $k$-means terminates within $O(d/\epsilon)$ iterations
- URL: http://arxiv.org/abs/2304.00419v1
- Date: Sun, 2 Apr 2023 00:58:29 GMT
- Title: Mini-batch $k$-means terminates within $O(d/\epsilon)$ iterations
- Authors: Gregory Schwartzman
- Abstract summary: We consider mini-batch $k$-means which terminates only when the improvement in the quality of the clustering on the sampled batch is below some threshold.
Although at first glance it appears that this algorithm might execute forever, we answer the above question in the affirmative.
We show the applicability of our results to the mini-batch $k$-means algorithm implemented in the scikit-learn (sklearn) python library.
- Score: 0.07614628596146598
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We answer the question: "Does local progress (on batches) imply global
progress (on the entire dataset) for mini-batch $k$-means?". Specifically, we
consider mini-batch $k$-means which terminates only when the improvement in the
quality of the clustering on the sampled batch is below some threshold.
Although at first glance it appears that this algorithm might execute
forever, we answer the above question in the affirmative and show that if the
batch is of size $\tilde{\Omega}((d/\epsilon)^2)$, it must terminate within
$O(d/\epsilon)$ iterations with high probability, where $d$ is the dimension of
the input, and $\epsilon$ is a threshold parameter for termination. This is
true regardless of how the centers are initialized. When the algorithm is
initialized with the $k$-means++ initialization scheme, it achieves an
approximation ratio of $O(\log k)$ (the same as the full-batch version).
Finally, we show the applicability of our results to the mini-batch $k$-means
algorithm implemented in the scikit-learn (sklearn) python library.
Related papers
- Mini-Batch Kernel $k$-means [4.604003661048267]
A single iteration of our algorithm takes $widetildeO(kb2)$ time, significantly faster than the $O(n2)$ time required by the full batch kernel $k$-means.
Experiments demonstrate that our algorithm consistently achieves a 10-100x speedup with minimal loss in quality.
arXiv Detail & Related papers (2024-10-08T10:59:14Z) - Multiple-policy Evaluation via Density Estimation [30.914344538340412]
We propose an algorithm named $mathrmCAESAR$ for this problem.
Up to low order and logarithmic terms $mathrmCAESAR$ achieves a sample complexity $tildeOleft(fracH4epsilon2sum_h=1Hmax_kin[K]sum_s,afrac(d_hpik(s,a))2mu*_h(s,a)right)$, where $dpi
arXiv Detail & Related papers (2024-03-29T23:55:25Z) - A Scalable Algorithm for Individually Fair K-means Clustering [77.93955971520549]
We present a scalable algorithm for the individually fair ($p$, $k$)-clustering problem introduced by Jung et al. and Mahabadi et al.
A clustering is then called individually fair if it has centers within distance $delta(x)$ of $x$ for each $xin P$.
We show empirically that not only is our algorithm much faster than prior work, but it also produces lower-cost solutions.
arXiv Detail & Related papers (2024-02-09T19:01:48Z) - Sketching Algorithms for Sparse Dictionary Learning: PTAS and Turnstile
Streaming [48.18845814885398]
We develop new techniques to extend the applicability of sketching-based approaches to sparse dictionary learning and the Euclidean $k$-means clustering problems.
On the fast algorithms front, we obtain a new approach for designing PTAS's for the $k$-means clustering problem.
On the streaming algorithms front, we obtain new upper bounds and lower bounds for dictionary learning and $k$-means clustering.
arXiv Detail & Related papers (2023-10-29T16:46:26Z) - Simple, Scalable and Effective Clustering via One-Dimensional
Projections [10.807367640692021]
Clustering is a fundamental problem in unsupervised machine learning with many applications in data analysis.
We introduce a simple randomized clustering algorithm that provably runs in expected time $O(mathrmnnz(X) + nlog n)$ for arbitrary $k$.
We prove that our algorithm achieves approximation ratio $smashwidetildeO(k4)$ on any input dataset for the $k$-means objective.
arXiv Detail & Related papers (2023-10-25T16:37:45Z) - Data Structures for Density Estimation [66.36971978162461]
Given a sublinear (in $n$) number of samples from $p$, our main result is the first data structure that identifies $v_i$ in time sublinear in $k$.
We also give an improved version of the algorithm of Acharya et al. that reports $v_i$ in time linear in $k$.
arXiv Detail & Related papers (2023-06-20T06:13:56Z) - A Nearly Tight Analysis of Greedy k-means++ [1.452875650827562]
The $k$-means++ algorithm is known to return a $Theta(log k)$ approximate solution in expectation.
We present nearly matching lower and upper bounds for the greedy $k$-means++ algorithm.
arXiv Detail & Related papers (2022-07-16T13:49:07Z) - Clustering Mixture Models in Almost-Linear Time via List-Decodable Mean
Estimation [58.24280149662003]
We study the problem of list-decodable mean estimation, where an adversary can corrupt a majority of the dataset.
We develop new algorithms for list-decodable mean estimation, achieving nearly-optimal statistical guarantees.
arXiv Detail & Related papers (2021-06-16T03:34:14Z) - Improved Sample Complexity for Incremental Autonomous Exploration in
MDPs [132.88757893161699]
We learn the set of $epsilon$-optimal goal-conditioned policies attaining all states that are incrementally reachable within $L$ steps.
DisCo is the first algorithm that can return an $epsilon/c_min$-optimal policy for any cost-sensitive shortest-path problem.
arXiv Detail & Related papers (2020-12-29T14:06:09Z) - List-Decodable Mean Estimation in Nearly-PCA Time [50.79691056481693]
We study the fundamental task of list-decodable mean estimation in high dimensions.
Our algorithm runs in time $widetildeO(ndk)$ for all $k = O(sqrtd) cup Omega(d)$, where $n$ is the size of the dataset.
A variant of our algorithm has runtime $widetildeO(ndk)$ for all $k$, at the expense of an $O(sqrtlog k)$ factor in the recovery guarantee
arXiv Detail & Related papers (2020-11-19T17:21:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.