Two-cluster test
- URL: http://arxiv.org/abs/2507.08382v2
- Date: Mon, 14 Jul 2025 06:58:33 GMT
- Title: Two-cluster test
- Authors: Xinying Liu, Lianyu Hu, Mudi Jiang, Simeng Zhang, Jun Lou, Zengyou He,
- Abstract summary: We introduce the two-cluster test issue and argue that it is a totally different significance testing issue from conventional two-sample test.<n>Experiments on both synthetic and real data sets show that the proposed test is able to significantly reduce the Type-I error rate.<n>More importantly, the practical usage of such two-cluster test is further verified through its applications in tree-based interpretable clustering and significance-based hierarchical clustering.
- Score: 1.871954330708119
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cluster analysis is a fundamental research issue in statistics and machine learning. In many modern clustering methods, we need to determine whether two subsets of samples come from the same cluster. Since these subsets are usually generated by certain clustering procedures, the deployment of classic two-sample tests in this context would yield extremely smaller p-values, leading to inflated Type-I error rate. To overcome this bias, we formally introduce the two-cluster test issue and argue that it is a totally different significance testing issue from conventional two-sample test. Meanwhile, we present a new method based on the boundary points between two subsets to derive an analytical p-value for the purpose of significance quantification. Experiments on both synthetic and real data sets show that the proposed test is able to significantly reduce the Type-I error rate, in comparison with several classic two-sample testing methods. More importantly, the practical usage of such two-cluster test is further verified through its applications in tree-based interpretable clustering and significance-based hierarchical clustering.
Related papers
- Statistical Verification of Linear Classifiers [76.95660509846216]
We propose a homogeneity test closely related to the concept of linear separability between two samples.<n>We focus on establishing upper bounds for the test's emphp-value when applied to two-dimensional samples.
arXiv Detail & Related papers (2025-01-24T11:56:45Z) - Machine Learning for Two-Sample Testing under Right-Censored Data: A Simulation Study [0.0]
This study is to evaluate the effectiveness of Machine Learning (ML) methods for two-sample testing with right-censored observations.
In total, this work covers 18 methods for two-sample testing under right-censored observations.
To test the two-sample problem with right-censored observations, one can use the proposed two-sample methods (scripts, dataset, and models are available on GitHub and Hugging Face)
arXiv Detail & Related papers (2024-09-12T16:38:20Z) - GCC: Generative Calibration Clustering [55.44944397168619]
We propose a novel Generative Clustering (GCC) method to incorporate feature learning and augmentation into clustering procedure.
First, we develop a discrimirative feature alignment mechanism to discover intrinsic relationship across real and generated samples.
Second, we design a self-supervised metric learning to generate more reliable cluster assignment.
arXiv Detail & Related papers (2024-04-14T01:51:11Z) - Deep anytime-valid hypothesis testing [29.273915933729057]
We propose a general framework for constructing powerful, sequential hypothesis tests for nonparametric testing problems.
We develop a principled approach of leveraging the representation capability of machine learning models within the testing-by-betting framework.
Empirical results on synthetic and real-world datasets demonstrate that tests instantiated using our general framework are competitive against specialized baselines.
arXiv Detail & Related papers (2023-10-30T09:46:19Z) - Bootstrapped Edge Count Tests for Nonparametric Two-Sample Inference
Under Heterogeneity [5.8010446129208155]
We develop a new nonparametric testing procedure that accurately detects differences between the two samples.
A comprehensive simulation study and an application to detecting user behaviors in online games demonstrates the excellent non-asymptotic performance of the proposed test.
arXiv Detail & Related papers (2023-04-26T22:25:44Z) - Parametric Classification for Generalized Category Discovery: A Baseline
Study [70.73212959385387]
Generalized Category Discovery (GCD) aims to discover novel categories in unlabelled datasets using knowledge learned from labelled samples.
We investigate the failure of parametric classifiers, verify the effectiveness of previous design choices when high-quality supervision is available, and identify unreliable pseudo-labels as a key problem.
We propose a simple yet effective parametric classification method that benefits from entropy regularisation, achieves state-of-the-art performance on multiple GCD benchmarks and shows strong robustness to unknown class numbers.
arXiv Detail & Related papers (2022-11-21T18:47:11Z) - Statistical and Computational Phase Transitions in Group Testing [73.55361918807883]
We study the group testing problem where the goal is to identify a set of k infected individuals carrying a rare disease.
We consider two different simple random procedures for assigning individuals tests.
arXiv Detail & Related papers (2022-06-15T16:38:50Z) - Selective inference for k-means clustering [0.0]
We propose a finite-sample p-value that controls the selective Type I error for a test of the difference in means between a pair of clusters obtained using k-means clustering.
We apply our proposal in simulation, and on hand-written digits data and single-cell RNA-sequencing data.
arXiv Detail & Related papers (2022-03-29T06:28:12Z) - Selective Inference for Hierarchical Clustering [2.3311605203774386]
We propose a selective inference approach to test for a difference in means between two clusters obtained from any clustering method.
Our procedure controls the selective Type I error rate by accounting for the fact that the null hypothesis was generated from the data.
arXiv Detail & Related papers (2020-12-05T03:03:19Z) - Scalable Hierarchical Agglomerative Clustering [65.66407726145619]
Existing scalable hierarchical clustering methods sacrifice quality for speed.
We present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points.
arXiv Detail & Related papers (2020-10-22T15:58:35Z) - Good Classifiers are Abundant in the Interpolating Regime [64.72044662855612]
We develop a methodology to compute precisely the full distribution of test errors among interpolating classifiers.
We find that test errors tend to concentrate around a small typical value $varepsilon*$, which deviates substantially from the test error of worst-case interpolating model.
Our results show that the usual style of analysis in statistical learning theory may not be fine-grained enough to capture the good generalization performance observed in practice.
arXiv Detail & Related papers (2020-06-22T21:12:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.