Related papers: Selective Inference for Hierarchical Clustering

Selective Inference for Hierarchical Clustering

URL: http://arxiv.org/abs/2012.02936v1
Date: Sat, 5 Dec 2020 03:03:19 GMT
Title: Selective Inference for Hierarchical Clustering
Authors: Lucy L. Gao, Jacob Bien and Daniela Witten
Abstract summary: We propose a selective inference approach to test for a difference in means between two clusters obtained from any clustering method. Our procedure controls the selective Type I error rate by accounting for the fact that the null hypothesis was generated from the data.
Score: 2.3311605203774386
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Testing for a difference in means between two groups is fundamental to answering research questions across virtually every scientific area. Classical tests control the Type I error rate when the groups are defined a priori. However, when the groups are instead defined via a clustering algorithm, then applying a classical test for a difference in means between the groups yields an extremely inflated Type I error rate. Notably, this problem persists even if two separate and independent data sets are used to define the groups and to test for a difference in their means. To address this problem, in this paper, we propose a selective inference approach to test for a difference in means between two clusters obtained from any clustering method. Our procedure controls the selective Type I error rate by accounting for the fact that the null hypothesis was generated from the data. We describe how to efficiently compute exact p-values for clusters obtained using agglomerative hierarchical clustering with many commonly used linkages. We apply our method to simulated data and to single-cell RNA-seq data.

Related papers

ABCDE: Application-Based Cluster Diff Evals [49.1574468325115]
It aims to be practical: it allows items to have associated importance values that are application-specific, it is frugal in its use of human judgements when determining which clustering is better, and it can report metrics for arbitrary slices of items. The approach to measuring the delta in the clustering quality is novel: instead of trying to construct an expensive ground truth up front and evaluating the each clustering with respect to that, ABCDE samples questions for judgement on the basis of the actual diffs between the clusterings.
arXiv Detail & Related papers (2024-07-31T08:29:35Z)
Instance-Optimal Cluster Recovery in the Labeled Stochastic Block Model [79.46465138631592]
We devise an efficient algorithm that recovers clusters using the observed labels. We present Instance-Adaptive Clustering (IAC), the first algorithm whose performance matches these lower bounds both in expectation and with high probability.
arXiv Detail & Related papers (2023-06-18T08:46:06Z)
A Computational Theory and Semi-Supervised Algorithm for Clustering [0.0]
A semi-supervised clustering algorithm is presented. The kernel of the clustering method is Mohammad's anomaly detection algorithm. Results are presented on synthetic and realworld data sets.
arXiv Detail & Related papers (2023-06-12T09:15:58Z)
Leveraging Structure for Improved Classification of Grouped Biased Data [8.121462458089143]
We consider semi-supervised binary classification for applications in which data points are naturally grouped. We derive a semi-supervised algorithm that explicitly leverages the structure to learn an optimal, group-aware, probability-outputd classifier.
arXiv Detail & Related papers (2022-12-07T15:18:21Z)
Confident Clustering via PCA Compression Ratio and Its Application to Single-cell RNA-seq Analysis [4.511561231517167]
We develop a confident clustering method to diminish the influence of boundary datapoints. We validate our algorithm on single-cell RNA-seq data. Unlike traditional clustering methods in single-cell analysis, the confident clustering shows high stability under different choices of parameters.
arXiv Detail & Related papers (2022-05-19T20:46:49Z)
Selective inference for k-means clustering [0.0]
We propose a finite-sample p-value that controls the selective Type I error for a test of the difference in means between a pair of clusters obtained using k-means clustering. We apply our proposal in simulation, and on hand-written digits data and single-cell RNA-sequencing data.
arXiv Detail & Related papers (2022-03-29T06:28:12Z)
Optimal Clustering with Bandit Feedback [57.672609011609886]
This paper considers the problem of online clustering with bandit feedback. It includes a novel stopping rule for sequential testing that circumvents the need to solve any NP-hard weighted clustering problem as its subroutines. We show through extensive simulations on synthetic and real-world datasets that BOC's performance matches the lower boundally, and significantly outperforms a non-adaptive baseline algorithm.
arXiv Detail & Related papers (2022-02-09T06:05:05Z)
Anomaly Clustering: Grouping Images into Coherent Clusters of Anomaly Types [60.45942774425782]
We introduce anomaly clustering, whose goal is to group data into coherent clusters of anomaly types. This is different from anomaly detection, whose goal is to divide anomalies from normal data. We present a simple yet effective clustering framework using a patch-based pretrained deep embeddings and off-the-shelf clustering methods.
arXiv Detail & Related papers (2021-12-21T23:11:33Z)
Selecting the number of clusters, clustering models, and algorithms. A unifying approach based on the quadratic discriminant score [0.5330240017302619]
We propose a selection rule that allows choosing among many clustering solutions. The proposed method has the distinctive advantage that it can compare partitions that cannot be compared with other state-of-the-art methods.
arXiv Detail & Related papers (2021-11-03T15:38:58Z)
Scalable Hierarchical Agglomerative Clustering [65.66407726145619]
Existing scalable hierarchical clustering methods sacrifice quality for speed. We present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points.
arXiv Detail & Related papers (2020-10-22T15:58:35Z)
Selective Inference for Latent Block Models [50.83356836818667]
This study provides a selective inference method for latent block models. We construct a statistical test on a set of row and column cluster memberships of a latent block model. The proposed exact and approximated tests work effectively, compared to the naive test that did not take the selective bias into account.
arXiv Detail & Related papers (2020-05-27T10:44:19Z)
Optimal Clustering from Noisy Binary Feedback [75.17453757892152]
We study the problem of clustering a set of items from binary user feedback. We devise an algorithm with a minimal cluster recovery error rate. For adaptive selection, we develop an algorithm inspired by the derivation of the information-theoretical error lower bounds.
arXiv Detail & Related papers (2019-10-14T09:18:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.