No-Substitution $k$-means Clustering with Low Center Complexity and
Memory
- URL: http://arxiv.org/abs/2102.09101v1
- Date: Thu, 18 Feb 2021 01:20:03 GMT
- Title: No-Substitution $k$-means Clustering with Low Center Complexity and
Memory
- Authors: Robi Bhattacharjee and Jacob Imola
- Abstract summary: We develop an algorithm with center complexity bounded by $Lower_36, k(X)$ that is a true $O(1)$-approximation with its approximation factor being independent of $k$ or $n$.
This algorithm is the first known algorithm with center complexity bounded by $Lower_36, k(X)$ that is a true $O(1)$-approximation with its approximation factor being independent of $k$ or $n$.
- Score: 5.837881923712394
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Clustering is a fundamental task in machine learning. Given a dataset $X =
\{x_1, \ldots x_n\}$, the goal of $k$-means clustering is to pick $k$ "centers"
from $X$ in a way that minimizes the sum of squared distances from each point
to its nearest center. We consider $k$-means clustering in the online, no
substitution setting, where one must decide whether to take $x_t$ as a center
immediately upon streaming it and cannot remove centers once taken.
The online, no substitution setting is challenging for clustering--one can
show that there exist datasets $X$ for which any $O(1)$-approximation $k$-means
algorithm must have center complexity $\Omega(n)$, meaning that it takes
$\Omega(n)$ centers in expectation. Bhattacharjee and Moshkovitz (2020) refined
this bound by defining a complexity measure called $Lower_{\alpha, k}(X)$, and
proving that any $\alpha$-approximation algorithm must have center complexity
$\Omega(Lower_{\alpha, k}(X))$. They then complemented their lower bound by
giving a $O(k^3)$-approximation algorithm with center complexity
$\tilde{O}(k^2Lower_{k^3, k}(X))$, thus showing that their parameter is a tight
measure of required center complexity. However, a major drawback of their
algorithm is its memory requirement, which is $O(n)$. This makes the algorithm
impractical for very large datasets.
In this work, we strictly improve upon their algorithm on all three fronts;
we develop a $36$-approximation algorithm with center complexity
$\tilde{O}(kLower_{36, k}(X))$ that uses only $O(k)$ additional memory. In
addition to having nearly optimal memory, this algorithm is the first known
algorithm with center complexity bounded by $Lower_{36, k}(X)$ that is a true
$O(1)$-approximation with its approximation factor being independent of $k$ or
$n$.
Related papers
- A Scalable Algorithm for Individually Fair K-means Clustering [77.93955971520549]
We present a scalable algorithm for the individually fair ($p$, $k$)-clustering problem introduced by Jung et al. and Mahabadi et al.
A clustering is then called individually fair if it has centers within distance $delta(x)$ of $x$ for each $xin P$.
We show empirically that not only is our algorithm much faster than prior work, but it also produces lower-cost solutions.
arXiv Detail & Related papers (2024-02-09T19:01:48Z) - Sketching Algorithms for Sparse Dictionary Learning: PTAS and Turnstile
Streaming [48.18845814885398]
We develop new techniques to extend the applicability of sketching-based approaches to sparse dictionary learning and the Euclidean $k$-means clustering problems.
On the fast algorithms front, we obtain a new approach for designing PTAS's for the $k$-means clustering problem.
On the streaming algorithms front, we obtain new upper bounds and lower bounds for dictionary learning and $k$-means clustering.
arXiv Detail & Related papers (2023-10-29T16:46:26Z) - Simple, Scalable and Effective Clustering via One-Dimensional
Projections [10.807367640692021]
Clustering is a fundamental problem in unsupervised machine learning with many applications in data analysis.
We introduce a simple randomized clustering algorithm that provably runs in expected time $O(mathrmnnz(X) + nlog n)$ for arbitrary $k$.
We prove that our algorithm achieves approximation ratio $smashwidetildeO(k4)$ on any input dataset for the $k$-means objective.
arXiv Detail & Related papers (2023-10-25T16:37:45Z) - Do you know what q-means? [50.045011844765185]
Clustering is one of the most important tools for analysis of large datasets.
We present an improved version of the "$q$-means" algorithm for clustering.
We also present a "dequantized" algorithm for $varepsilon which runs in $Obig(frack2varepsilon2(sqrtkd + log(Nd))big.
arXiv Detail & Related papers (2023-08-18T17:52:12Z) - Near-Optimal Bounds for Learning Gaussian Halfspaces with Random
Classification Noise [50.64137465792738]
We show that any efficient SQ algorithm for the problem requires sample complexity at least $Omega(d1/2/(maxp, epsilon)2)$.
Our lower bound suggests that this quadratic dependence on $1/epsilon$ is inherent for efficient algorithms.
arXiv Detail & Related papers (2023-07-13T18:59:28Z) - Near-Optimal Quantum Coreset Construction Algorithms for Clustering [15.513270929560088]
We give quantum algorithms that find coresets for $k$-clustering in $mathbbRd$ with $tildeO(sqrtnkd3/2)$ query complexity.
Our coreset reduces the input size from $n$ to $mathrmpoly(kepsilon-1d)$, so that existing $alpha$-approximation algorithms for clustering can run on top of it.
arXiv Detail & Related papers (2023-06-05T12:22:46Z) - Learning a Latent Simplex in Input-Sparsity Time [58.30321592603066]
We consider the problem of learning a latent $k$-vertex simplex $KsubsetmathbbRdtimes n$, given access to $AinmathbbRdtimes n$.
We show that the dependence on $k$ in the running time is unnecessary given a natural assumption about the mass of the top $k$ singular values of $A$.
arXiv Detail & Related papers (2021-05-17T16:40:48Z) - No-substitution k-means Clustering with Adversarial Order [8.706058694927613]
We introduce a new complexity measure that quantifies the difficulty of clustering a dataset arriving in arbitrary order.
Our new algorithm takes only $textpoly(klog(n))$ centers and is a $textpoly(k)$-approximation.
arXiv Detail & Related papers (2020-12-28T22:35:44Z) - Streaming Complexity of SVMs [110.63976030971106]
We study the space complexity of solving the bias-regularized SVM problem in the streaming model.
We show that for both problems, for dimensions of $frac1lambdaepsilon$, one can obtain streaming algorithms with spacely smaller than $frac1lambdaepsilon$.
arXiv Detail & Related papers (2020-07-07T17:10:00Z) - Coreset-based Strategies for Robust Center-type Problems [0.6875312133832077]
We devise coreset-based strategies for the two problems which yield efficient sequential, MapReduce, and Streaming algorithms.
For wide ranges of the parameters $k,zepsilon, D$, we obtain a sequential algorithm with running time linear in $|V|$, and MapReduce/Streaming algorithms with few rounds/passes and substantially sublinear local/working memory.
arXiv Detail & Related papers (2020-02-18T10:04:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.