An Efficient Permutation-Based Kernel Two-Sample Test
- URL: http://arxiv.org/abs/2502.13570v1
- Date: Wed, 19 Feb 2025 09:22:48 GMT
- Title: An Efficient Permutation-Based Kernel Two-Sample Test
- Authors: Antoine Chatalic, Marco Letizia, Nicolas Schreuder, and Lorenzo Rosasco,
- Abstract summary: Two-sample hypothesis testing is a fundamental problem in statistics and machine learning.
In this work, we use a Nystr"om approximation of the maximum mean discrepancy (MMD) to design a computationally efficient and practical testing algorithm.
- Score: 12.331562761756679
- License:
- Abstract: Two-sample hypothesis testing-determining whether two sets of data are drawn from the same distribution-is a fundamental problem in statistics and machine learning with broad scientific applications. In the context of nonparametric testing, maximum mean discrepancy (MMD) has gained popularity as a test statistic due to its flexibility and strong theoretical foundations. However, its use in large-scale scenarios is plagued by high computational costs. In this work, we use a Nystr\"om approximation of the MMD to design a computationally efficient and practical testing algorithm while preserving statistical guarantees. Our main result is a finite-sample bound on the power of the proposed test for distributions that are sufficiently separated with respect to the MMD. The derived separation rate matches the known minimax optimal rate in this setting. We support our findings with a series of numerical experiments, emphasizing realistic scientific data.
Related papers
- Computational-Statistical Trade-off in Kernel Two-Sample Testing with Random Fourier Features [3.744589644319257]
The Maximum Mean Discrepancy (MMD) test has emerged as an effective tool for handling complex and high-dimensional data.
It has been unclear whether it is possible to attain the same power guarantee as the MMD test at sub-quadratic time cost.
We show that it is possible to attain the same minimax separation rates as the MMD test within sub-quadratic time.
arXiv Detail & Related papers (2024-07-12T04:08:01Z) - Precise Error Rates for Computationally Efficient Testing [75.63895690909241]
We revisit the question of simple-versus-simple hypothesis testing with an eye towards computational complexity.
An existing test based on linear spectral statistics achieves the best possible tradeoff curve between type I and type II error rates.
arXiv Detail & Related papers (2023-11-01T04:41:16Z) - MMD-FUSE: Learning and Combining Kernels for Two-Sample Testing Without
Data Splitting [28.59390881834003]
We propose novel statistics which maximise the power of a two-sample test based on the Maximum Mean Discrepancy (MMD)
We show how these kernels can be chosen in a data-dependent but permutation-independent way, in a well-calibrated test, avoiding data splitting.
We highlight the applicability of our MMD-FUSE test on both synthetic low-dimensional and real-world high-dimensional data, and compare its performance in terms of power against current state-of-the-art kernel tests.
arXiv Detail & Related papers (2023-06-14T23:13:03Z) - Boosting the Power of Kernel Two-Sample Tests [4.07125466598411]
A kernel two-sample test based on the maximum mean discrepancy (MMD) is one of the most popular methods for detecting differences between two distributions over general metric spaces.
We propose a method to boost the power of the kernel test by combining MMD estimates over multiple kernels using their Mahalanobis distance.
arXiv Detail & Related papers (2023-02-21T14:14:30Z) - Spectral Regularized Kernel Two-Sample Tests [7.915420897195129]
We show the popular MMD (maximum mean discrepancy) two-sample test to be not optimal in terms of the separation boundary measured in Hellinger distance.
We propose a modification to the MMD test based on spectral regularization and prove the proposed test to be minimax optimal with a smaller separation boundary than that achieved by the MMD test.
Our results hold for the permutation variant of the test where the test threshold is chosen elegantly through the permutation of the samples.
arXiv Detail & Related papers (2022-12-19T00:42:21Z) - Importance sampling for stochastic quantum simulations [68.8204255655161]
We introduce the qDrift protocol, which builds random product formulas by sampling from the Hamiltonian according to the coefficients.
We show that the simulation cost can be reduced while achieving the same accuracy, by considering the individual simulation cost during the sampling stage.
Results are confirmed by numerical simulations performed on a lattice nuclear effective field theory.
arXiv Detail & Related papers (2022-12-12T15:06:32Z) - Sequential Permutation Testing of Random Forest Variable Importance
Measures [68.8204255655161]
It is proposed here to use sequential permutation tests and sequential p-value estimation to reduce the high computational costs associated with conventional permutation tests.
The results of simulation studies confirm that the theoretical properties of the sequential tests apply.
The numerical stability of the methods is investigated in two additional application studies.
arXiv Detail & Related papers (2022-06-02T20:16:50Z) - Nonparametric Conditional Local Independence Testing [69.31200003384122]
Conditional local independence is an independence relation among continuous time processes.
No nonparametric test of conditional local independence has been available.
We propose such a nonparametric test based on double machine learning.
arXiv Detail & Related papers (2022-03-25T10:31:02Z) - Significance tests of feature relevance for a blackbox learner [6.72450543613463]
We derive two consistent tests for the feature relevance of a blackbox learner.
The first evaluates a loss difference with perturbation on an inference sample.
The second splits the inference sample into two but does not require data perturbation.
arXiv Detail & Related papers (2021-03-02T00:59:19Z) - Maximum Mean Discrepancy Test is Aware of Adversarial Attacks [122.51040127438324]
The maximum mean discrepancy (MMD) test could in principle detect any distributional discrepancy between two datasets.
It has been shown that the MMD test is unaware of adversarial attacks.
arXiv Detail & Related papers (2020-10-22T03:42:12Z) - Two-Sample Testing on Ranked Preference Data and the Role of Modeling
Assumptions [57.77347280992548]
In this paper, we design two-sample tests for pairwise comparison data and ranking data.
Our test requires essentially no assumptions on the distributions.
By applying our two-sample test on real-world pairwise comparison data, we conclude that ratings and rankings provided by people are indeed distributed differently.
arXiv Detail & Related papers (2020-06-21T20:51:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.