List-Decodable Mean Estimation in Nearly-PCA Time
- URL: http://arxiv.org/abs/2011.09973v1
- Date: Thu, 19 Nov 2020 17:21:37 GMT
- Title: List-Decodable Mean Estimation in Nearly-PCA Time
- Authors: Ilias Diakonikolas, Daniel M. Kane, Daniel Kongsgaard, Jerry Li, Kevin
Tian
- Abstract summary: We study the fundamental task of list-decodable mean estimation in high dimensions.
Our algorithm runs in time $widetildeO(ndk)$ for all $k = O(sqrtd) cup Omega(d)$, where $n$ is the size of the dataset.
A variant of our algorithm has runtime $widetildeO(ndk)$ for all $k$, at the expense of an $O(sqrtlog k)$ factor in the recovery guarantee
- Score: 50.79691056481693
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Traditionally, robust statistics has focused on designing estimators tolerant
to a minority of contaminated data. Robust list-decodable learning focuses on
the more challenging regime where only a minority $\frac 1 k$ fraction of the
dataset is drawn from the distribution of interest, and no assumptions are made
on the remaining data. We study the fundamental task of list-decodable mean
estimation in high dimensions. Our main result is a new list-decodable mean
estimation algorithm for bounded covariance distributions with optimal sample
complexity and error rate, running in nearly-PCA time. Assuming the ground
truth distribution on $\mathbb{R}^d$ has bounded covariance, our algorithm
outputs a list of $O(k)$ candidate means, one of which is within distance
$O(\sqrt{k})$ from the truth. Our algorithm runs in time $\widetilde{O}(ndk)$
for all $k = O(\sqrt{d}) \cup \Omega(d)$, where $n$ is the size of the dataset.
We also show that a variant of our algorithm has runtime $\widetilde{O}(ndk)$
for all $k$, at the expense of an $O(\sqrt{\log k})$ factor in the recovery
guarantee. This runtime matches up to logarithmic factors the cost of
performing a single $k$-PCA on the data, which is a natural bottleneck of known
algorithms for (very) special cases of our problem, such as clustering
well-separated mixtures. Prior to our work, the fastest list-decodable mean
estimation algorithms had runtimes $\widetilde{O}(n^2 d k^2)$ and
$\widetilde{O}(nd k^{\ge 6})$.
Our approach builds on a novel soft downweighting method, $\mathsf{SIFT}$,
which is arguably the simplest known polynomial-time mean estimation technique
in the list-decodable learning setting. To develop our fast algorithms, we
boost the computational cost of $\mathsf{SIFT}$ via a careful "win-win-win"
analysis of an approximate Ky Fan matrix multiplicative weights procedure we
develop, which we believe may be of independent interest.
Related papers
- Statistical-Computational Trade-offs for Density Estimation [60.81548752871115]
We show that for a broad class of data structures their bounds cannot be significantly improved.
This is a novel emphstatistical-computational trade-off for density estimation.
arXiv Detail & Related papers (2024-10-30T15:03:33Z) - Simple, Scalable and Effective Clustering via One-Dimensional
Projections [10.807367640692021]
Clustering is a fundamental problem in unsupervised machine learning with many applications in data analysis.
We introduce a simple randomized clustering algorithm that provably runs in expected time $O(mathrmnnz(X) + nlog n)$ for arbitrary $k$.
We prove that our algorithm achieves approximation ratio $smashwidetildeO(k4)$ on any input dataset for the $k$-means objective.
arXiv Detail & Related papers (2023-10-25T16:37:45Z) - Efficiently Learning One-Hidden-Layer ReLU Networks via Schur
Polynomials [50.90125395570797]
We study the problem of PAC learning a linear combination of $k$ ReLU activations under the standard Gaussian distribution on $mathbbRd$ with respect to the square loss.
Our main result is an efficient algorithm for this learning task with sample and computational complexity $(dk/epsilon)O(k)$, whereepsilon>0$ is the target accuracy.
arXiv Detail & Related papers (2023-07-24T14:37:22Z) - Replicable Clustering [57.19013971737493]
We propose algorithms for the statistical $k$-medians, statistical $k$-means, and statistical $k$-centers problems by utilizing approximation routines for their counterparts in a black-box manner.
We also provide experiments on synthetic distributions in 2D using the $k$-means++ implementation from sklearn as a black-box that validate our theoretical results.
arXiv Detail & Related papers (2023-02-20T23:29:43Z) - List-Decodable Sparse Mean Estimation via Difference-of-Pairs Filtering [42.526664955704746]
We develop a novel, conceptually simpler technique for list-decodable sparse mean estimation.
In particular, for distributions with "certifiably bounded" $t-th moments in $k$-sparse directions, our algorithm achieves error of $(1/alpha)O (1/t)$ with sample complexity $m = (klog(n))O(t)/alpha(mnt)$.
For the special case of Gaussian inliers, our algorithm achieves the optimal error guarantee of $Theta (sqrtlog
arXiv Detail & Related papers (2022-06-10T17:38:18Z) - List-Decodable Sparse Mean Estimation [7.594050968868919]
A surge of recent research interest has been focusing on the list-decodable setting where $alpha in (0, frac12]$, and the goal is to output a finite number of estimates among which at least one approximates the target mean.
In this paper, we consider that the underlying target is the mean distribution $k$ and the first contribution is the first sample $Obig(mathrmpoly(k, log dbig)$, i.e. poly-logarithmic in the dimension dimension.
arXiv Detail & Related papers (2022-05-28T05:13:45Z) - TURF: A Two-factor, Universal, Robust, Fast Distribution Learning
Algorithm [64.13217062232874]
One of its most powerful and successful modalities approximates every distribution to an $ell$ distance essentially at most a constant times larger than its closest $t$-piece degree-$d_$.
We provide a method that estimates this number near-optimally, hence helps approach the best possible approximation.
arXiv Detail & Related papers (2022-02-15T03:49:28Z) - Clustering Mixture Models in Almost-Linear Time via List-Decodable Mean
Estimation [58.24280149662003]
We study the problem of list-decodable mean estimation, where an adversary can corrupt a majority of the dataset.
We develop new algorithms for list-decodable mean estimation, achieving nearly-optimal statistical guarantees.
arXiv Detail & Related papers (2021-06-16T03:34:14Z) - List-Decodable Mean Estimation via Iterative Multi-Filtering [44.805549762166926]
We are given a set $T$ of points in $mathbbRd$ with the promise that an unknown $alpha$-fraction of points in $T$ are drawn from an unknown mean and bounded covariance distribution $D$.
We output a small list of hypothesis vectors such that at least one of them is close to the mean of $D$.
In more detail, our algorithm is sample and computationally efficient, and achieves information-theoretically near-optimal error.
arXiv Detail & Related papers (2020-06-18T17:47:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.