Statistical power for cluster analysis
- URL: http://arxiv.org/abs/2003.00381v3
- Date: Tue, 25 May 2021 15:21:57 GMT
- Title: Statistical power for cluster analysis
- Authors: E. S. Dalmaijer, C. L. Nord, and D. E. Astle
- Abstract summary: Cluster algorithms are increasingly popular in biomedical research.
We estimate power and accuracy for common analysis through simulation.
We recommend that researchers only apply cluster analysis when large subgroup separation is expected.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Cluster algorithms are increasingly popular in biomedical research due to
their compelling ability to identify discrete subgroups in data, and their
increasing accessibility in mainstream software. While guidelines exist for
algorithm selection and outcome evaluation, there are no firmly established
ways of computing a priori statistical power for cluster analysis. Here, we
estimated power and accuracy for common analysis pipelines through simulation.
We varied subgroup size, number, separation (effect size), and covariance
structure. We then subjected generated datasets to dimensionality reduction
(none, multidimensional scaling, or UMAP) and cluster algorithms (k-means,
agglomerative hierarchical clustering with Ward or average linkage and
Euclidean or cosine distance, HDBSCAN). Finally, we compared the statistical
power of discrete (k-means), "fuzzy" (c-means), and finite mixture modelling
approaches (which include latent profile and latent class analysis). We found
that outcomes were driven by large effect sizes or the accumulation of many
smaller effects across features, and were unaffected by differences in
covariance structure. Sufficient statistical power was achieved with relatively
small samples (N=20 per subgroup), provided cluster separation is large
({\Delta}=4). Fuzzy clustering provided a more parsimonious and powerful
alternative for identifying separable multivariate normal distributions,
particularly those with slightly lower centroid separation ({\Delta}=3).
Overall, we recommend that researchers 1) only apply cluster analysis when
large subgroup separation is expected, 2) aim for sample sizes of N=20 to N=30
per expected subgroup, 3) use multidimensional scaling to improve cluster
separation, and 4) use fuzzy clustering or finite mixture modelling approaches
that are more powerful and more parsimonious with partially overlapping
multivariate normal distributions.
Related papers
- Self-Supervised Graph Embedding Clustering [70.36328717683297]
K-means one-step dimensionality reduction clustering method has made some progress in addressing the curse of dimensionality in clustering tasks.
We propose a unified framework that integrates manifold learning with K-means, resulting in the self-supervised graph embedding framework.
arXiv Detail & Related papers (2024-09-24T08:59:51Z) - A simulation study of cluster search algorithms in data set generated by Gaussian mixture models [0.0]
This study examines centroid- and model-based cluster search algorithms in various cases that Gaussian mixture models (GMMs) can generate.
The results show that some cluster-splitting criteria based on Euclidean distance make unreasonable decisions when clusters overlap.
arXiv Detail & Related papers (2024-07-27T07:47:25Z) - Causal K-Means Clustering [5.087519744951637]
Causal k-Means Clustering harnesses the widely-used k-means clustering algorithm to uncover the unknown subgroup structure.
We present a plug-in estimator which is simple and readily implementable using off-the-shelf algorithms.
Our proposed methods are especially useful for modern outcome-wide studies with multiple treatment levels.
arXiv Detail & Related papers (2024-05-05T23:59:51Z) - Linear time Evidence Accumulation Clustering with KMeans [0.0]
This work describes a trick which mimic the behavior of average linkage clustering.
We found a way of computing efficiently the density of a partitioning, reducing the cost from a quadratic to linear complexity.
The k-means results are comparable to the best state of the art in terms of NMI while keeping the computational cost low.
arXiv Detail & Related papers (2023-11-15T14:12:59Z) - Superclustering by finding statistically significant separable groups of
optimal gaussian clusters [0.0]
The paper presents the algorithm for clustering a dataset by grouping the optimal, from the point of view of the BIC criterion.
An essential advantage of the algorithm is its ability to predict correct supercluster for new data based on already trained clusterer.
arXiv Detail & Related papers (2023-09-05T23:49:46Z) - Instance-Optimal Cluster Recovery in the Labeled Stochastic Block Model [79.46465138631592]
We devise an efficient algorithm that recovers clusters using the observed labels.
We present Instance-Adaptive Clustering (IAC), the first algorithm whose performance matches these lower bounds both in expectation and with high probability.
arXiv Detail & Related papers (2023-06-18T08:46:06Z) - Rethinking k-means from manifold learning perspective [122.38667613245151]
We present a new clustering algorithm which directly detects clusters of data without mean estimation.
Specifically, we construct distance matrix between data points by Butterworth filter.
To well exploit the complementary information embedded in different views, we leverage the tensor Schatten p-norm regularization.
arXiv Detail & Related papers (2023-05-12T03:01:41Z) - Subspace clustering in high-dimensions: Phase transitions \&
Statistical-to-Computational gap [24.073221004661427]
A simple model to study subspace clustering is the high-dimensional $k$-Gaussian mixture model.
We provide an exact characterization of the statistically optimal reconstruction error in this model in the high-dimensional regime with extensive sparsity.
arXiv Detail & Related papers (2022-05-26T17:47:35Z) - Differentially-Private Clustering of Easy Instances [67.04951703461657]
In differentially private clustering, the goal is to identify $k$ cluster centers without disclosing information on individual data points.
We provide implementable differentially private clustering algorithms that provide utility when the data is "easy"
We propose a framework that allows us to apply non-private clustering algorithms to the easy instances and privately combine the results.
arXiv Detail & Related papers (2021-12-29T08:13:56Z) - Scalable Hierarchical Agglomerative Clustering [65.66407726145619]
Existing scalable hierarchical clustering methods sacrifice quality for speed.
We present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points.
arXiv Detail & Related papers (2020-10-22T15:58:35Z) - Computationally efficient sparse clustering [67.95910835079825]
We provide a finite sample analysis of a new clustering algorithm based on PCA.
We show that it achieves the minimax optimal misclustering rate in the regime $|theta infty$.
arXiv Detail & Related papers (2020-05-21T17:51:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.