Clustering Mixtures of Bounded Covariance Distributions Under Optimal
Separation
- URL: http://arxiv.org/abs/2312.11769v1
- Date: Tue, 19 Dec 2023 01:01:53 GMT
- Title: Clustering Mixtures of Bounded Covariance Distributions Under Optimal
Separation
- Authors: Ilias Diakonikolas, Daniel M. Kane, Jasper C. H. Lee, Thanasis Pittas
- Abstract summary: We study the clustering problem for mixtures of bounded covariance distributions.
We give the first poly-time algorithm for this clustering task.
Our algorithm is robust to $Omega(alpha)$-fraction of adversarial outliers.
- Score: 44.25945344950543
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the clustering problem for mixtures of bounded covariance
distributions, under a fine-grained separation assumption. Specifically, given
samples from a $k$-component mixture distribution $D = \sum_{i =1}^k w_i P_i$,
where each $w_i \ge \alpha$ for some known parameter $\alpha$, and each $P_i$
has unknown covariance $\Sigma_i \preceq \sigma^2_i \cdot I_d$ for some unknown
$\sigma_i$, the goal is to cluster the samples assuming a pairwise mean
separation in the order of $(\sigma_i+\sigma_j)/\sqrt{\alpha}$ between every
pair of components $P_i$ and $P_j$. Our contributions are as follows:
For the special case of nearly uniform mixtures, we give the first poly-time
algorithm for this clustering task. Prior work either required separation
scaling with the maximum cluster standard deviation (i.e. $\max_i \sigma_i$)
[DKK+22b] or required both additional structural assumptions and mean
separation scaling as a large degree polynomial in $1/\alpha$ [BKK22].
For general-weight mixtures, we point out that accurate clustering is
information-theoretically impossible under our fine-grained mean separation
assumptions. We introduce the notion of a clustering refinement -- a list of
not-too-small subsets satisfying a similar separation, and which can be merged
into a clustering approximating the ground truth -- and show that it is
possible to efficiently compute an accurate clustering refinement of the
samples. Furthermore, under a variant of the "no large sub-cluster'' condition
from in prior work [BKK22], we show that our algorithm outputs an accurate
clustering, not just a refinement, even for general-weight mixtures. As a
corollary, we obtain efficient clustering algorithms for mixtures of
well-conditioned high-dimensional log-concave distributions.
Moreover, our algorithm is robust to $\Omega(\alpha)$-fraction of adversarial
outliers.
Related papers
- Dimension Reduction via Sum-of-Squares and Improved Clustering Algorithms for Non-Spherical Mixtures [5.668124846154999]
We develop a new approach for clustering non-spherical (i.e., arbitrary component covariances) Gaussian mixture models via a subroutine.
Our method gives a non-spherical analog of the classical dimension reduction, based on singular value decomposition.
Our algorithms naturally extend to tolerating a dimension-independent fraction of arbitrary outliers.
arXiv Detail & Related papers (2024-11-19T11:58:51Z) - Are Easy Data Easy (for K-Means) [0.0]
This paper investigates the capability of correctly recovering well-separated clusters by various brands of the $k$-means algorithm.
A new algorithm is proposed that is a variation of $k$-means++ via repeated subsampling when choosing a seed.
arXiv Detail & Related papers (2023-08-02T09:40:19Z) - Replicable Clustering [57.19013971737493]
We propose algorithms for the statistical $k$-medians, statistical $k$-means, and statistical $k$-centers problems by utilizing approximation routines for their counterparts in a black-box manner.
We also provide experiments on synthetic distributions in 2D using the $k$-means++ implementation from sklearn as a black-box that validate our theoretical results.
arXiv Detail & Related papers (2023-02-20T23:29:43Z) - Differentially-Private Clustering of Easy Instances [67.04951703461657]
In differentially private clustering, the goal is to identify $k$ cluster centers without disclosing information on individual data points.
We provide implementable differentially private clustering algorithms that provide utility when the data is "easy"
We propose a framework that allows us to apply non-private clustering algorithms to the easy instances and privately combine the results.
arXiv Detail & Related papers (2021-12-29T08:13:56Z) - Clustering Mixtures with Almost Optimal Separation in Polynomial Time [21.509452343745252]
We consider the problem of clustering mixtures of mean-separated Gaussians in high dimensions.
It is folklore that separation $Delta = Theta (sqrtlog k)$ is both necessary and sufficient to recover a good clustering.
We give an algorithm which takes many samples and time, and which can successfully recover a good clustering.
arXiv Detail & Related papers (2021-12-01T18:34:09Z) - Fuzzy Clustering with Similarity Queries [56.96625809888241]
The fuzzy or soft objective is a popular generalization of the well-known $k$-means problem.
We show that by making few queries, the problem becomes easier to solve.
arXiv Detail & Related papers (2021-06-04T02:32:26Z) - K-expectiles clustering [0.0]
We propose a novel partitioning clustering algorithm based on expectiles.
We suggest two schemes: fixed $tau$ clustering, and adaptive $tau$ clustering.
arXiv Detail & Related papers (2021-03-16T21:14:56Z) - Computationally efficient sparse clustering [67.95910835079825]
We provide a finite sample analysis of a new clustering algorithm based on PCA.
We show that it achieves the minimax optimal misclustering rate in the regime $|theta infty$.
arXiv Detail & Related papers (2020-05-21T17:51:30Z) - Robustly Learning any Clusterable Mixture of Gaussians [55.41573600814391]
We study the efficient learnability of high-dimensional Gaussian mixtures in the adversarial-robust setting.
We provide an algorithm that learns the components of an $epsilon$-corrupted $k$-mixture within information theoretically near-optimal error proofs of $tildeO(epsilon)$.
Our main technical contribution is a new robust identifiability proof clusters from a Gaussian mixture, which can be captured by the constant-degree Sum of Squares proof system.
arXiv Detail & Related papers (2020-05-13T16:44:12Z) - Outlier-Robust Clustering of Non-Spherical Mixtures [5.863264019032882]
We give the first outlier-robust efficient algorithm for clustering a mixture of $k$ statistically separated d-dimensional Gaussians (k-GMMs)
Our results extend to clustering mixtures of arbitrary affine transforms of the uniform distribution on the $d$-dimensional unit sphere.
arXiv Detail & Related papers (2020-05-06T17:24:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.