No-substitution k-means Clustering with Adversarial Order
- URL: http://arxiv.org/abs/2012.14512v1
- Date: Mon, 28 Dec 2020 22:35:44 GMT
- Title: No-substitution k-means Clustering with Adversarial Order
- Authors: Robi Bhattacharjee and Michal Moshkovitz
- Abstract summary: We introduce a new complexity measure that quantifies the difficulty of clustering a dataset arriving in arbitrary order.
Our new algorithm takes only $textpoly(klog(n))$ centers and is a $textpoly(k)$-approximation.
- Score: 8.706058694927613
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We investigate $k$-means clustering in the online no-substitution setting
when the input arrives in \emph{arbitrary} order. In this setting, points
arrive one after another, and the algorithm is required to instantly decide
whether to take the current point as a center before observing the next point.
Decisions are irrevocable. The goal is to minimize both the number of centers
and the $k$-means cost. Previous works in this setting assume that the input's
order is random, or that the input's aspect ratio is bounded. It is known that
if the order is arbitrary and there is no assumption on the input, then any
algorithm must take all points as centers. Moreover, assuming a bounded aspect
ratio is too restrictive -- it does not include natural input generated from
mixture models.
We introduce a new complexity measure that quantifies the difficulty of
clustering a dataset arriving in arbitrary order. We design a new random
algorithm and prove that if applied on data with complexity $d$, the algorithm
takes $O(d\log(n) k\log(k))$ centers and is an $O(k^3)$-approximation. We also
prove that if the data is sampled from a ``natural" distribution, such as a
mixture of $k$ Gaussians, then the new complexity measure is equal to
$O(k^2\log(n))$. This implies that for data generated from those distributions,
our new algorithm takes only $\text{poly}(k\log(n))$ centers and is a
$\text{poly}(k)$-approximation. In terms of negative results, we prove that the
number of centers needed to achieve an $\alpha$-approximation is at least
$\Omega\left(\frac{d}{k\log(n\alpha)}\right)$.
Related papers
- A Scalable Algorithm for Individually Fair K-means Clustering [77.93955971520549]
We present a scalable algorithm for the individually fair ($p$, $k$)-clustering problem introduced by Jung et al. and Mahabadi et al.
A clustering is then called individually fair if it has centers within distance $delta(x)$ of $x$ for each $xin P$.
We show empirically that not only is our algorithm much faster than prior work, but it also produces lower-cost solutions.
arXiv Detail & Related papers (2024-02-09T19:01:48Z) - Do you know what q-means? [50.045011844765185]
Clustering is one of the most important tools for analysis of large datasets.
We present an improved version of the "$q$-means" algorithm for clustering.
We also present a "dequantized" algorithm for $varepsilon which runs in $Obig(frack2varepsilon2(sqrtkd + log(Nd))big.
arXiv Detail & Related papers (2023-08-18T17:52:12Z) - Near-Optimal Bounds for Learning Gaussian Halfspaces with Random
Classification Noise [50.64137465792738]
We show that any efficient SQ algorithm for the problem requires sample complexity at least $Omega(d1/2/(maxp, epsilon)2)$.
Our lower bound suggests that this quadratic dependence on $1/epsilon$ is inherent for efficient algorithms.
arXiv Detail & Related papers (2023-07-13T18:59:28Z) - Improved Learning-augmented Algorithms for k-means and k-medians
Clustering [8.04779839951237]
We consider the problem of clustering in the learning-augmented setting, where we are given a data set in $d$-dimensional Euclidean space.
We propose a deterministic $k$-means algorithm that produces centers with improved bound on clustering cost.
Our algorithm works even when the predictions are not very accurate, i.e. our bound holds for $alpha$ up to $1/2$, an improvement over $alpha$ being at most $1/7$ in the previous work.
arXiv Detail & Related papers (2022-10-31T03:00:11Z) - List-Decodable Sparse Mean Estimation via Difference-of-Pairs Filtering [42.526664955704746]
We develop a novel, conceptually simpler technique for list-decodable sparse mean estimation.
In particular, for distributions with "certifiably bounded" $t-th moments in $k$-sparse directions, our algorithm achieves error of $(1/alpha)O (1/t)$ with sample complexity $m = (klog(n))O(t)/alpha(mnt)$.
For the special case of Gaussian inliers, our algorithm achieves the optimal error guarantee of $Theta (sqrtlog
arXiv Detail & Related papers (2022-06-10T17:38:18Z) - Robust Sparse Mean Estimation via Sum of Squares [42.526664955704746]
We study the problem of high-dimensional sparse mean estimation in the presence of an $epsilon$-fraction of adversarial outliers.
Our algorithms follow the Sum-of-Squares based, to algorithms approach.
arXiv Detail & Related papers (2022-06-07T16:49:54Z) - Clustering Mixture Models in Almost-Linear Time via List-Decodable Mean
Estimation [58.24280149662003]
We study the problem of list-decodable mean estimation, where an adversary can corrupt a majority of the dataset.
We develop new algorithms for list-decodable mean estimation, achieving nearly-optimal statistical guarantees.
arXiv Detail & Related papers (2021-06-16T03:34:14Z) - No-Substitution $k$-means Clustering with Low Center Complexity and
Memory [5.837881923712394]
We develop an algorithm with center complexity bounded by $Lower_36, k(X)$ that is a true $O(1)$-approximation with its approximation factor being independent of $k$ or $n$.
This algorithm is the first known algorithm with center complexity bounded by $Lower_36, k(X)$ that is a true $O(1)$-approximation with its approximation factor being independent of $k$ or $n$.
arXiv Detail & Related papers (2021-02-18T01:20:03Z) - List-Decodable Mean Estimation in Nearly-PCA Time [50.79691056481693]
We study the fundamental task of list-decodable mean estimation in high dimensions.
Our algorithm runs in time $widetildeO(ndk)$ for all $k = O(sqrtd) cup Omega(d)$, where $n$ is the size of the dataset.
A variant of our algorithm has runtime $widetildeO(ndk)$ for all $k$, at the expense of an $O(sqrtlog k)$ factor in the recovery guarantee
arXiv Detail & Related papers (2020-11-19T17:21:37Z) - Locally Private Hypothesis Selection [96.06118559817057]
We output a distribution from $mathcalQ$ whose total variation distance to $p$ is comparable to the best such distribution.
We show that the constraint of local differential privacy incurs an exponential increase in cost.
Our algorithms result in exponential improvements on the round complexity of previous methods.
arXiv Detail & Related papers (2020-02-21T18:30:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.