Outlier-Robust Clustering of Non-Spherical Mixtures
- URL: http://arxiv.org/abs/2005.02970v3
- Date: Mon, 14 Dec 2020 18:00:59 GMT
- Title: Outlier-Robust Clustering of Non-Spherical Mixtures
- Authors: Ainesh Bakshi and Pravesh Kothari
- Abstract summary: We give the first outlier-robust efficient algorithm for clustering a mixture of $k$ statistically separated d-dimensional Gaussians (k-GMMs)
Our results extend to clustering mixtures of arbitrary affine transforms of the uniform distribution on the $d$-dimensional unit sphere.
- Score: 5.863264019032882
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We give the first outlier-robust efficient algorithm for clustering a mixture
of $k$ statistically separated d-dimensional Gaussians (k-GMMs). Concretely,
our algorithm takes input an $\epsilon$-corrupted sample from a $k$-GMM and whp
in $d^{\text{poly}(k/\eta)}$ time, outputs an approximate clustering that
misclassifies at most $k^{O(k)}(\epsilon+\eta)$ fraction of the points whenever
every pair of mixture components are separated by
$1-\exp(-\text{poly}(k/\eta)^k)$ in total variation (TV) distance. Such a
result was not previously known even for $k=2$. TV separation is the
statistically weakest possible notion of separation and captures important
special cases such as mixed linear regression and subspace clustering.
Our main conceptual contribution is to distill simple analytic properties -
(certifiable) hypercontractivity and bounded variance of degree 2 polynomials
and anti-concentration of linear projections - that are necessary and
sufficient for mixture models to be (efficiently) clusterable. As a
consequence, our results extend to clustering mixtures of arbitrary affine
transforms of the uniform distribution on the $d$-dimensional unit sphere. Even
the information-theoretic clusterability of separated distributions satisfying
these two analytic assumptions was not known prior to our work and is likely to
be of independent interest.
Our algorithms build on the recent sequence of works relying on certifiable
anti-concentration first introduced in the works of Karmarkar, Klivans, and
Kothari and Raghavendra, and Yau in 2019. Our techniques expand the
sum-of-squares toolkit to show robust certifiability of TV-separated Gaussian
clusters in data. This involves giving a low-degree sum-of-squares proof of
statements that relate parameter (i.e. mean and covariances) distance to total
variation distance by relying only on hypercontractivity and
anti-concentration.
Related papers
- Dimension Reduction via Sum-of-Squares and Improved Clustering Algorithms for Non-Spherical Mixtures [5.668124846154999]
We develop a new approach for clustering non-spherical (i.e., arbitrary component covariances) Gaussian mixture models via a subroutine.
Our method gives a non-spherical analog of the classical dimension reduction, based on singular value decomposition.
Our algorithms naturally extend to tolerating a dimension-independent fraction of arbitrary outliers.
arXiv Detail & Related papers (2024-11-19T11:58:51Z) - Learning general Gaussian mixtures with efficient score matching [16.06356123715737]
We study the problem of learning mixtures of $k$ Gaussians in $d$ dimensions.
We make no separation assumptions on the underlying mixture components.
We give an algorithm that draws $dmathrmpoly(k/varepsilon)$ samples from the target mixture, runs in sample-polynomial time, and constructs a sampler.
arXiv Detail & Related papers (2024-04-29T17:30:36Z) - Clustering Mixtures of Bounded Covariance Distributions Under Optimal
Separation [44.25945344950543]
We study the clustering problem for mixtures of bounded covariance distributions.
We give the first poly-time algorithm for this clustering task.
Our algorithm is robust to $Omega(alpha)$-fraction of adversarial outliers.
arXiv Detail & Related papers (2023-12-19T01:01:53Z) - A Robust and Flexible EM Algorithm for Mixtures of Elliptical
Distributions with Missing Data [71.9573352891936]
This paper tackles the problem of missing data imputation for noisy and non-Gaussian data.
A new EM algorithm is investigated for mixtures of elliptical distributions with the property of handling potential missing data.
Experimental results on synthetic data demonstrate that the proposed algorithm is robust to outliers and can be used with non-Gaussian data.
arXiv Detail & Related papers (2022-01-28T10:01:37Z) - Beyond Parallel Pancakes: Quasi-Polynomial Time Guarantees for
Non-Spherical Gaussian Mixtures [9.670578317106182]
We consider mixtures of $kgeq 2$ Gaussian components with unknown means and unknown covariance (identical for all components) that are well-separated.
We show that this kind of hardness can only appear if mixing weights are allowed to be exponentially small.
We develop an algorithm based on the sum-of-squares method with running time quasi-polynomial in the minimum mixing weight.
arXiv Detail & Related papers (2021-12-10T10:51:44Z) - Learning Gaussian Mixtures with Generalised Linear Models: Precise
Asymptotics in High-dimensions [79.35722941720734]
Generalised linear models for multi-class classification problems are one of the fundamental building blocks of modern machine learning tasks.
We prove exacts characterising the estimator in high-dimensions via empirical risk minimisation.
We discuss how our theory can be applied beyond the scope of synthetic data.
arXiv Detail & Related papers (2021-06-07T16:53:56Z) - Spatially relaxed inference on high-dimensional linear models [48.989769153211995]
We study the properties of ensembled clustered inference algorithms which combine spatially constrained clustering, statistical inference, and ensembling to aggregate several clustered inference solutions.
We show that ensembled clustered inference algorithms control the $delta$-FWER under standard assumptions for $delta$ equal to the largest cluster diameter.
arXiv Detail & Related papers (2021-06-04T16:37:19Z) - Robustly Learning any Clusterable Mixture of Gaussians [55.41573600814391]
We study the efficient learnability of high-dimensional Gaussian mixtures in the adversarial-robust setting.
We provide an algorithm that learns the components of an $epsilon$-corrupted $k$-mixture within information theoretically near-optimal error proofs of $tildeO(epsilon)$.
Our main technical contribution is a new robust identifiability proof clusters from a Gaussian mixture, which can be captured by the constant-degree Sum of Squares proof system.
arXiv Detail & Related papers (2020-05-13T16:44:12Z) - Algebraic and Analytic Approaches for Parameter Learning in Mixture
Models [66.96778152993858]
We present two different approaches for parameter learning in several mixture models in one dimension.
For some of these distributions, our results represent the first guarantees for parameter estimation.
arXiv Detail & Related papers (2020-01-19T05:10:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.