Selective inference for k-means clustering
- URL: http://arxiv.org/abs/2203.15267v1
- Date: Tue, 29 Mar 2022 06:28:12 GMT
- Title: Selective inference for k-means clustering
- Authors: Yiqun T. Chen, Daniela M. Witten
- Abstract summary: We propose a finite-sample p-value that controls the selective Type I error for a test of the difference in means between a pair of clusters obtained using k-means clustering.
We apply our proposal in simulation, and on hand-written digits data and single-cell RNA-sequencing data.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We consider the problem of testing for a difference in means between clusters
of observations identified via k-means clustering. In this setting, classical
hypothesis tests lead to an inflated Type I error rate. To overcome this
problem, we take a selective inference approach. We propose a finite-sample
p-value that controls the selective Type I error for a test of the difference
in means between a pair of clusters obtained using k-means clustering, and show
that it can be efficiently computed. We apply our proposal in simulation, and
on hand-written digits data and single-cell RNA-sequencing data.
Related papers
- On uniqueness of the set of k-means [0.5735035463793009]
We give an assessment on consistency of the empirical k-means adapted to the setting of non-uniqueness.
We derive a bootstrap test for uniqueness of the set of k-means.
The results are illustrated with examples of different types of non-uniqueness.
arXiv Detail & Related papers (2024-10-17T12:40:56Z) - Self-Supervised Graph Embedding Clustering [70.36328717683297]
K-means one-step dimensionality reduction clustering method has made some progress in addressing the curse of dimensionality in clustering tasks.
We propose a unified framework that integrates manifold learning with K-means, resulting in the self-supervised graph embedding framework.
arXiv Detail & Related papers (2024-09-24T08:59:51Z) - Interpretable Clustering with the Distinguishability Criterion [0.4419843514606336]
We present a global criterion called the Distinguishability criterion to quantify the separability of identified clusters and validate inferred cluster configurations.
We propose a combined loss function-based computational framework that integrates the Distinguishability criterion with many commonly used clustering procedures.
We present these new algorithms as well as the results from comprehensive data analysis based on simulation studies and real data applications.
arXiv Detail & Related papers (2024-04-24T16:38:15Z) - Rethinking k-means from manifold learning perspective [122.38667613245151]
We present a new clustering algorithm which directly detects clusters of data without mean estimation.
Specifically, we construct distance matrix between data points by Butterworth filter.
To well exploit the complementary information embedded in different views, we leverage the tensor Schatten p-norm regularization.
arXiv Detail & Related papers (2023-05-12T03:01:41Z) - Sketch-and-solve approaches to k-means clustering by semidefinite
programming [14.930208990741132]
We introduce a sketch-and-solve approach to speed up the Peng-Wei semidefinite relaxation of k-means clustering.
If the data is appropriately separated we identify the k-means optimal clustering.
Otherwise, our approach provides a high-confidence lower bound on the optimal k-means value.
arXiv Detail & Related papers (2022-11-28T19:51:30Z) - A One-shot Framework for Distributed Clustered Learning in Heterogeneous
Environments [54.172993875654015]
The paper proposes a family of communication efficient methods for distributed learning in heterogeneous environments.
One-shot approach, based on local computations at the users and a clustering based aggregation step at the server is shown to provide strong learning guarantees.
For strongly convex problems it is shown that, as long as the number of data points per user is above a threshold, the proposed approach achieves order-optimal mean-squared error rates in terms of the sample size.
arXiv Detail & Related papers (2022-09-22T09:04:10Z) - Local versions of sum-of-norms clustering [77.34726150561087]
We show that our method can separate arbitrarily close balls in the ball model.
We prove a quantitative bound on the error incurred in the clustering of disjoint connected sets.
arXiv Detail & Related papers (2021-09-20T14:45:29Z) - Selective Inference for Hierarchical Clustering [2.3311605203774386]
We propose a selective inference approach to test for a difference in means between two clusters obtained from any clustering method.
Our procedure controls the selective Type I error rate by accounting for the fact that the null hypothesis was generated from the data.
arXiv Detail & Related papers (2020-12-05T03:03:19Z) - Decorrelated Clustering with Data Selection Bias [55.91842043124102]
We propose a novel Decorrelation regularized K-Means algorithm (DCKM) for clustering with data selection bias.
Our DCKM algorithm achieves significant performance gains, indicating the necessity of removing unexpected feature correlations induced by selection bias.
arXiv Detail & Related papers (2020-06-29T08:55:50Z) - Selective Inference for Latent Block Models [50.83356836818667]
This study provides a selective inference method for latent block models.
We construct a statistical test on a set of row and column cluster memberships of a latent block model.
The proposed exact and approximated tests work effectively, compared to the naive test that did not take the selective bias into account.
arXiv Detail & Related papers (2020-05-27T10:44:19Z) - Blocked Clusterwise Regression [0.0]
We generalize previous approaches to discrete unobserved heterogeneity by allowing each unit to have multiple latent variables.
We contribute to the theory of clustering with an over-specified number of clusters and derive new convergence rates for this setting.
arXiv Detail & Related papers (2020-01-29T23:29:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.