Selective Inference for Hierarchical Clustering
- URL: http://arxiv.org/abs/2012.02936v1
- Date: Sat, 5 Dec 2020 03:03:19 GMT
- Title: Selective Inference for Hierarchical Clustering
- Authors: Lucy L. Gao, Jacob Bien and Daniela Witten
- Abstract summary: We propose a selective inference approach to test for a difference in means between two clusters obtained from any clustering method.
Our procedure controls the selective Type I error rate by accounting for the fact that the null hypothesis was generated from the data.
- Score: 2.3311605203774386
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Testing for a difference in means between two groups is fundamental to
answering research questions across virtually every scientific area. Classical
tests control the Type I error rate when the groups are defined a priori.
However, when the groups are instead defined via a clustering algorithm, then
applying a classical test for a difference in means between the groups yields
an extremely inflated Type I error rate. Notably, this problem persists even if
two separate and independent data sets are used to define the groups and to
test for a difference in their means. To address this problem, in this paper,
we propose a selective inference approach to test for a difference in means
between two clusters obtained from any clustering method. Our procedure
controls the selective Type I error rate by accounting for the fact that the
null hypothesis was generated from the data. We describe how to efficiently
compute exact p-values for clusters obtained using agglomerative hierarchical
clustering with many commonly used linkages. We apply our method to simulated
data and to single-cell RNA-seq data.
Related papers
- ABCDE: Application-Based Cluster Diff Evals [49.1574468325115]
It aims to be practical: it allows items to have associated importance values that are application-specific, it is frugal in its use of human judgements when determining which clustering is better, and it can report metrics for arbitrary slices of items.
The approach to measuring the delta in the clustering quality is novel: instead of trying to construct an expensive ground truth up front and evaluating the each clustering with respect to that, ABCDE samples questions for judgement on the basis of the actual diffs between the clusterings.
arXiv Detail & Related papers (2024-07-31T08:29:35Z) - A Computational Theory and Semi-Supervised Algorithm for Clustering [0.0]
A semi-supervised clustering algorithm is presented.
The kernel of the clustering method is Mohammad's anomaly detection algorithm.
Results are presented on synthetic and realworld data sets.
arXiv Detail & Related papers (2023-06-12T09:15:58Z) - Leveraging Structure for Improved Classification of Grouped Biased Data [8.121462458089143]
We consider semi-supervised binary classification for applications in which data points are naturally grouped.
We derive a semi-supervised algorithm that explicitly leverages the structure to learn an optimal, group-aware, probability-outputd classifier.
arXiv Detail & Related papers (2022-12-07T15:18:21Z) - Confident Clustering via PCA Compression Ratio and Its Application to
Single-cell RNA-seq Analysis [4.511561231517167]
We develop a confident clustering method to diminish the influence of boundary datapoints.
We validate our algorithm on single-cell RNA-seq data.
Unlike traditional clustering methods in single-cell analysis, the confident clustering shows high stability under different choices of parameters.
arXiv Detail & Related papers (2022-05-19T20:46:49Z) - Selective inference for k-means clustering [0.0]
We propose a finite-sample p-value that controls the selective Type I error for a test of the difference in means between a pair of clusters obtained using k-means clustering.
We apply our proposal in simulation, and on hand-written digits data and single-cell RNA-sequencing data.
arXiv Detail & Related papers (2022-03-29T06:28:12Z) - Optimal Clustering with Bandit Feedback [57.672609011609886]
This paper considers the problem of online clustering with bandit feedback.
It includes a novel stopping rule for sequential testing that circumvents the need to solve any NP-hard weighted clustering problem as its subroutines.
We show through extensive simulations on synthetic and real-world datasets that BOC's performance matches the lower boundally, and significantly outperforms a non-adaptive baseline algorithm.
arXiv Detail & Related papers (2022-02-09T06:05:05Z) - Anomaly Clustering: Grouping Images into Coherent Clusters of Anomaly
Types [60.45942774425782]
We introduce anomaly clustering, whose goal is to group data into coherent clusters of anomaly types.
This is different from anomaly detection, whose goal is to divide anomalies from normal data.
We present a simple yet effective clustering framework using a patch-based pretrained deep embeddings and off-the-shelf clustering methods.
arXiv Detail & Related papers (2021-12-21T23:11:33Z) - Selecting the number of clusters, clustering models, and algorithms. A
unifying approach based on the quadratic discriminant score [0.5330240017302619]
We propose a selection rule that allows choosing among many clustering solutions.
The proposed method has the distinctive advantage that it can compare partitions that cannot be compared with other state-of-the-art methods.
arXiv Detail & Related papers (2021-11-03T15:38:58Z) - Scalable Hierarchical Agglomerative Clustering [65.66407726145619]
Existing scalable hierarchical clustering methods sacrifice quality for speed.
We present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points.
arXiv Detail & Related papers (2020-10-22T15:58:35Z) - Selective Inference for Latent Block Models [50.83356836818667]
This study provides a selective inference method for latent block models.
We construct a statistical test on a set of row and column cluster memberships of a latent block model.
The proposed exact and approximated tests work effectively, compared to the naive test that did not take the selective bias into account.
arXiv Detail & Related papers (2020-05-27T10:44:19Z) - Optimal Clustering from Noisy Binary Feedback [75.17453757892152]
We study the problem of clustering a set of items from binary user feedback.
We devise an algorithm with a minimal cluster recovery error rate.
For adaptive selection, we develop an algorithm inspired by the derivation of the information-theoretical error lower bounds.
arXiv Detail & Related papers (2019-10-14T09:18:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.