Clustering Mixture Models in Almost-Linear Time via List-Decodable Mean
Estimation
- URL: http://arxiv.org/abs/2106.08537v1
- Date: Wed, 16 Jun 2021 03:34:14 GMT
- Title: Clustering Mixture Models in Almost-Linear Time via List-Decodable Mean
Estimation
- Authors: Ilias Diakonikolas, Daniel M. Kane, Daniel Kongsgaard, Jerry Li, Kevin
Tian
- Abstract summary: We study the problem of list-decodable mean estimation, where an adversary can corrupt a majority of the dataset.
We develop new algorithms for list-decodable mean estimation, achieving nearly-optimal statistical guarantees.
- Score: 58.24280149662003
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study the problem of list-decodable mean estimation, where an adversary
can corrupt a majority of the dataset. Specifically, we are given a set $T$ of
$n$ points in $\mathbb{R}^d$ and a parameter $0< \alpha <\frac 1 2$ such that
an $\alpha$-fraction of the points in $T$ are i.i.d. samples from a
well-behaved distribution $\mathcal{D}$ and the remaining $(1-\alpha)$-fraction
of the points are arbitrary. The goal is to output a small list of vectors at
least one of which is close to the mean of $\mathcal{D}$. As our main
contribution, we develop new algorithms for list-decodable mean estimation,
achieving nearly-optimal statistical guarantees, with running time $n^{1 +
o(1)} d$. All prior algorithms for this problem had additional polynomial
factors in $\frac 1 \alpha$. As a corollary, we obtain the first almost-linear
time algorithms for clustering mixtures of $k$ separated well-behaved
distributions, nearly-matching the statistical guarantees of spectral methods.
Prior clustering algorithms inherently relied on an application of $k$-PCA,
thereby incurring runtimes of $\Omega(n d k)$. This marks the first runtime
improvement for this basic statistical problem in nearly two decades.
The starting point of our approach is a novel and simpler near-linear time
robust mean estimation algorithm in the $\alpha \to 1$ regime, based on a
one-shot matrix multiplicative weights-inspired potential decrease. We
crucially leverage this new algorithmic framework in the context of the
iterative multi-filtering technique of Diakonikolas et. al. '18, '20, providing
a method to simultaneously cluster and downsample points using one-dimensional
projections --- thus, bypassing the $k$-PCA subroutines required by prior
algorithms.
Related papers
- A Scalable Algorithm for Individually Fair K-means Clustering [77.93955971520549]
We present a scalable algorithm for the individually fair ($p$, $k$)-clustering problem introduced by Jung et al. and Mahabadi et al.
A clustering is then called individually fair if it has centers within distance $delta(x)$ of $x$ for each $xin P$.
We show empirically that not only is our algorithm much faster than prior work, but it also produces lower-cost solutions.
arXiv Detail & Related papers (2024-02-09T19:01:48Z) - Simple, Scalable and Effective Clustering via One-Dimensional
Projections [10.807367640692021]
Clustering is a fundamental problem in unsupervised machine learning with many applications in data analysis.
We introduce a simple randomized clustering algorithm that provably runs in expected time $O(mathrmnnz(X) + nlog n)$ for arbitrary $k$.
We prove that our algorithm achieves approximation ratio $smashwidetildeO(k4)$ on any input dataset for the $k$-means objective.
arXiv Detail & Related papers (2023-10-25T16:37:45Z) - Do you know what q-means? [50.045011844765185]
Clustering is one of the most important tools for analysis of large datasets.
We present an improved version of the "$q$-means" algorithm for clustering.
We also present a "dequantized" algorithm for $varepsilon which runs in $Obig(frack2varepsilon2(sqrtkd + log(Nd))big.
arXiv Detail & Related papers (2023-08-18T17:52:12Z) - Replicable Clustering [57.19013971737493]
We propose algorithms for the statistical $k$-medians, statistical $k$-means, and statistical $k$-centers problems by utilizing approximation routines for their counterparts in a black-box manner.
We also provide experiments on synthetic distributions in 2D using the $k$-means++ implementation from sklearn as a black-box that validate our theoretical results.
arXiv Detail & Related papers (2023-02-20T23:29:43Z) - Private estimation algorithms for stochastic block models and mixture
models [63.07482515700984]
General tools for designing efficient private estimation algorithms.
First efficient $(epsilon, delta)$-differentially private algorithm for both weak recovery and exact recovery.
arXiv Detail & Related papers (2023-01-11T09:12:28Z) - List-Decodable Sparse Mean Estimation via Difference-of-Pairs Filtering [42.526664955704746]
We develop a novel, conceptually simpler technique for list-decodable sparse mean estimation.
In particular, for distributions with "certifiably bounded" $t-th moments in $k$-sparse directions, our algorithm achieves error of $(1/alpha)O (1/t)$ with sample complexity $m = (klog(n))O(t)/alpha(mnt)$.
For the special case of Gaussian inliers, our algorithm achieves the optimal error guarantee of $Theta (sqrtlog
arXiv Detail & Related papers (2022-06-10T17:38:18Z) - List-Decodable Sparse Mean Estimation [7.594050968868919]
A surge of recent research interest has been focusing on the list-decodable setting where $alpha in (0, frac12]$, and the goal is to output a finite number of estimates among which at least one approximates the target mean.
In this paper, we consider that the underlying target is the mean distribution $k$ and the first contribution is the first sample $Obig(mathrmpoly(k, log dbig)$, i.e. poly-logarithmic in the dimension dimension.
arXiv Detail & Related papers (2022-05-28T05:13:45Z) - List-Decodable Mean Estimation in Nearly-PCA Time [50.79691056481693]
We study the fundamental task of list-decodable mean estimation in high dimensions.
Our algorithm runs in time $widetildeO(ndk)$ for all $k = O(sqrtd) cup Omega(d)$, where $n$ is the size of the dataset.
A variant of our algorithm has runtime $widetildeO(ndk)$ for all $k$, at the expense of an $O(sqrtlog k)$ factor in the recovery guarantee
arXiv Detail & Related papers (2020-11-19T17:21:37Z) - List-Decodable Mean Estimation via Iterative Multi-Filtering [44.805549762166926]
We are given a set $T$ of points in $mathbbRd$ with the promise that an unknown $alpha$-fraction of points in $T$ are drawn from an unknown mean and bounded covariance distribution $D$.
We output a small list of hypothesis vectors such that at least one of them is close to the mean of $D$.
In more detail, our algorithm is sample and computationally efficient, and achieves information-theoretically near-optimal error.
arXiv Detail & Related papers (2020-06-18T17:47:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.