Computationally efficient sparse clustering
- URL: http://arxiv.org/abs/2005.10817v3
- Date: Mon, 22 Mar 2021 17:38:04 GMT
- Title: Computationally efficient sparse clustering
- Authors: Matthias L\"offler, Alexander S. Wein, Afonso S. Bandeira
- Abstract summary: We provide a finite sample analysis of a new clustering algorithm based on PCA.
We show that it achieves the minimax optimal misclustering rate in the regime $|theta infty$.
- Score: 67.95910835079825
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study statistical and computational limits of clustering when the means of
the centres are sparse and their dimension is possibly much larger than the
sample size. Our theoretical analysis focuses on the model $X_i = z_i \theta +
\varepsilon_i, ~z_i \in \{-1,1\}, ~\varepsilon_i \thicksim \mathcal{N}(0,I)$,
which has two clusters with centres $\theta$ and $-\theta$. We provide a finite
sample analysis of a new sparse clustering algorithm based on sparse PCA and
show that it achieves the minimax optimal misclustering rate in the regime
$\|\theta\| \rightarrow \infty$.
Our results require the sparsity to grow slower than the square root of the
sample size. Using a recent framework for computational lower bounds -- the
low-degree likelihood ratio -- we give evidence that this condition is
necessary for any polynomial-time clustering algorithm to succeed below the BBP
threshold. This complements existing evidence based on reductions and
statistical query lower bounds. Compared to these existing results, we cover a
wider set of parameter regimes and give a more precise understanding of the
runtime required and the misclustering error achievable. Our results imply that
a large class of tests based on low-degree polynomials fail to solve even the
weak testing task.
Related papers
- Subspace clustering in high-dimensions: Phase transitions \&
Statistical-to-Computational gap [24.073221004661427]
A simple model to study subspace clustering is the high-dimensional $k$-Gaussian mixture model.
We provide an exact characterization of the statistically optimal reconstruction error in this model in the high-dimensional regime with extensive sparsity.
arXiv Detail & Related papers (2022-05-26T17:47:35Z) - Minimax Optimal Quantization of Linear Models: Information-Theoretic
Limits and Efficient Algorithms [59.724977092582535]
We consider the problem of quantizing a linear model learned from measurements.
We derive an information-theoretic lower bound for the minimax risk under this setting.
We show that our method and upper-bounds can be extended for two-layer ReLU neural networks.
arXiv Detail & Related papers (2022-02-23T02:39:04Z) - Recovering Unbalanced Communities in the Stochastic Block Model With
Application to Clustering with a Faulty Oracle [9.578056676899203]
oracle block model (SBM) is a fundamental model for studying graph clustering or community detection in networks.
We provide a simple SVD-based algorithm for recovering the communities in the SBM with communities of varying sizes.
arXiv Detail & Related papers (2022-02-17T08:51:19Z) - Lattice-Based Methods Surpass Sum-of-Squares in Clustering [98.46302040220395]
Clustering is a fundamental primitive in unsupervised learning.
Recent work has established lower bounds against the class of low-degree methods.
We show that, perhaps surprisingly, this particular clustering model textitdoes not exhibit a statistical-to-computational gap.
arXiv Detail & Related papers (2021-12-07T18:50:17Z) - Spatially relaxed inference on high-dimensional linear models [48.989769153211995]
We study the properties of ensembled clustered inference algorithms which combine spatially constrained clustering, statistical inference, and ensembling to aggregate several clustered inference solutions.
We show that ensembled clustered inference algorithms control the $delta$-FWER under standard assumptions for $delta$ equal to the largest cluster diameter.
arXiv Detail & Related papers (2021-06-04T16:37:19Z) - Fuzzy Clustering with Similarity Queries [56.96625809888241]
The fuzzy or soft objective is a popular generalization of the well-known $k$-means problem.
We show that by making few queries, the problem becomes easier to solve.
arXiv Detail & Related papers (2021-06-04T02:32:26Z) - Breaking the Sample Size Barrier in Model-Based Reinforcement Learning
with a Generative Model [50.38446482252857]
This paper is concerned with the sample efficiency of reinforcement learning, assuming access to a generative model (or simulator)
We first consider $gamma$-discounted infinite-horizon Markov decision processes (MDPs) with state space $mathcalS$ and action space $mathcalA$.
We prove that a plain model-based planning algorithm suffices to achieve minimax-optimal sample complexity given any target accuracy level.
arXiv Detail & Related papers (2020-05-26T17:53:18Z) - Statistical power for cluster analysis [0.0]
Cluster algorithms are increasingly popular in biomedical research.
We estimate power and accuracy for common analysis through simulation.
We recommend that researchers only apply cluster analysis when large subgroup separation is expected.
arXiv Detail & Related papers (2020-03-01T02:43:15Z) - Structures of Spurious Local Minima in $k$-means [20.155509538529568]
We investigate the structures of spurious local solutions to the $k$-means problem.
We prove that this is essentially the only type of spurious local minima under a separation condition.
arXiv Detail & Related papers (2020-02-16T22:15:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.