High-dimensional and universally consistent k-sample tests
- URL: http://arxiv.org/abs/1910.08883v4
- Date: Wed, 11 Oct 2023 17:14:41 GMT
- Title: High-dimensional and universally consistent k-sample tests
- Authors: Sambit Panda, Cencheng Shen, Ronan Perry, Jelle Zorn, Antoine Lutz,
Carey E. Priebe, Joshua T. Vogelstein
- Abstract summary: k-sample testing problem involves determining whether $k$ groups of data points are each drawn from the same distribution.
Independence tests achieve universally consistent k-sample testing.
- Score: 18.327837489069907
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The k-sample testing problem involves determining whether $k$ groups of data
points are each drawn from the same distribution. The standard method for
k-sample testing in biomedicine is Multivariate analysis of variance (MANOVA),
despite that it depends on strong, and often unsuitable, parametric
assumptions. Moreover, independence testing and k-sample testing are closely
related, and several universally consistent high-dimensional independence tests
such as distance correlation (Dcorr) and Hilbert-Schmidt-Independence-Criterion
(Hsic) enjoy solid theoretical and empirical properties. In this paper, we
prove that independence tests achieve universally consistent k-sample testing
and that k-sample statistics such as Energy and Maximum Mean Discrepancy (MMD)
are precisely equivalent to Dcorr. An empirical evaluation of nonparametric
independence tests showed that they generally perform better than the popular
MANOVA test, even in Gaussian distributed scenarios. The evaluation included
several popular independence statistics and covered a comprehensive set of
simulations. Additionally, the testing approach was extended to perform
multiway and multilevel tests, which were demonstrated in a simulated study as
well as a real-world fMRI brain scans with a set of attributes.
Related papers
- A Sample Efficient Conditional Independence Test in the Presence of Discretization [54.047334792855345]
Conditional Independence (CI) tests directly to discretized data can lead to incorrect conclusions.<n>Recent advancements have sought to infer the correct CI relationship between the latent variables through binarizing observed data.<n>Motivated by this, this paper introduces a sample-efficient CI test that does not rely on the binarization process.
arXiv Detail & Related papers (2025-06-10T12:41:26Z) - Statistical Verification of Linear Classifiers [76.95660509846216]
We propose a homogeneity test closely related to the concept of linear separability between two samples.
We focus on establishing upper bounds for the test's emphp-value when applied to two-dimensional samples.
arXiv Detail & Related papers (2025-01-24T11:56:45Z) - On uniqueness of the set of k-means [0.5735035463793009]
We give an assessment on consistency of the empirical k-means adapted to the setting of non-uniqueness.
We derive a bootstrap test for uniqueness of the set of k-means.
The results are illustrated with examples of different types of non-uniqueness.
arXiv Detail & Related papers (2024-10-17T12:40:56Z) - Detecting Adversarial Data by Probing Multiple Perturbations Using
Expected Perturbation Score [62.54911162109439]
Adversarial detection aims to determine whether a given sample is an adversarial one based on the discrepancy between natural and adversarial distributions.
We propose a new statistic called expected perturbation score (EPS), which is essentially the expected score of a sample after various perturbations.
We develop EPS-based maximum mean discrepancy (MMD) as a metric to measure the discrepancy between the test sample and natural samples.
arXiv Detail & Related papers (2023-05-25T13:14:58Z) - Using Perturbation to Improve Goodness-of-Fit Tests based on Kernelized
Stein Discrepancy [3.78967502155084]
Kernelized Stein discrepancy (KSD) is a score-based discrepancy widely used in goodness-of-fit tests.
We show theoretically and empirically that the KSD test can suffer from low power when the target and the alternative distributions have the same well-separated modes but differ in mixing proportions.
arXiv Detail & Related papers (2023-04-28T11:13:18Z) - Targeted Separation and Convergence with Kernel Discrepancies [61.973643031360254]
kernel-based discrepancy measures are required to (i) separate a target P from other probability measures or (ii) control weak convergence to P.
In this article we derive new sufficient and necessary conditions to ensure (i) and (ii)
For MMDs on separable metric spaces, we characterize those kernels that separate Bochner embeddable measures and introduce simple conditions for separating all measures with unbounded kernels.
arXiv Detail & Related papers (2022-09-26T16:41:16Z) - Predicting Out-of-Domain Generalization with Neighborhood Invariance [59.05399533508682]
We propose a measure of a classifier's output invariance in a local transformation neighborhood.
Our measure is simple to calculate, does not depend on the test point's true label, and can be applied even in out-of-domain (OOD) settings.
In experiments on benchmarks in image classification, sentiment analysis, and natural language inference, we demonstrate a strong and robust correlation between our measure and actual OOD generalization.
arXiv Detail & Related papers (2022-07-05T14:55:16Z) - Selective inference for k-means clustering [0.0]
We propose a finite-sample p-value that controls the selective Type I error for a test of the difference in means between a pair of clusters obtained using k-means clustering.
We apply our proposal in simulation, and on hand-written digits data and single-cell RNA-sequencing data.
arXiv Detail & Related papers (2022-03-29T06:28:12Z) - Nonparametric Conditional Local Independence Testing [69.31200003384122]
Conditional local independence is an independence relation among continuous time processes.
No nonparametric test of conditional local independence has been available.
We propose such a nonparametric test based on double machine learning.
arXiv Detail & Related papers (2022-03-25T10:31:02Z) - Calibration of Neural Networks using Splines [51.42640515410253]
Measuring calibration error amounts to comparing two empirical distributions.
We introduce a binning-free calibration measure inspired by the classical Kolmogorov-Smirnov (KS) statistical test.
Our method consistently outperforms existing methods on KS error as well as other commonly used calibration measures.
arXiv Detail & Related papers (2020-06-23T07:18:05Z) - Stable Prediction via Leveraging Seed Variable [73.9770220107874]
Previous machine learning methods might exploit subtly spurious correlations in training data induced by non-causal variables for prediction.
We propose a conditional independence test based algorithm to separate causal variables with a seed variable as priori, and adopt them for stable prediction.
Our algorithm outperforms state-of-the-art methods for stable prediction.
arXiv Detail & Related papers (2020-06-09T06:56:31Z) - High-Dimensional Independence Testing via Maximum and Average Distance
Correlations [5.756296617325109]
We characterize consistency properties in high-dimensional settings with respect to the number of marginally dependent dimensions.
We examine the advantages of each test statistic, examine their respective null distributions, and present a fast chi-square-based testing procedure.
arXiv Detail & Related papers (2020-01-04T16:21:50Z) - The Chi-Square Test of Distance Correlation [7.748852202364896]
chi-square test is non-parametric, extremely fast, and applicable to bias-corrected distance correlation using any strong negative type metric or characteristic kernel.
We show that the underlying chi-square distribution well approximates and dominates the limiting null distribution in upper tail, prove the chi-square test can be valid and consistent for testing independence.
arXiv Detail & Related papers (2019-12-27T15:16:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.