Accelerated Computation of a High Dimensional Kolmogorov-Smirnov
Distance
- URL: http://arxiv.org/abs/2106.13706v1
- Date: Fri, 25 Jun 2021 15:45:18 GMT
- Title: Accelerated Computation of a High Dimensional Kolmogorov-Smirnov
Distance
- Authors: Alex Hagen, Shane Jackson, James Kahn, Jan Strube, Isabel Haide, Karl
Pazdernik, Connor Hainje
- Abstract summary: We extend the powerful Kolmogorov-Smirnov two sample test to a high dimensional form.
We call our result the d-dimensional Kolmogorov-Smirnov test (ddKS)
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Statistical testing is widespread and critical for a variety of scientific
disciplines. The advent of machine learning and the increase of computing power
has increased the interest in the analysis and statistical testing of
multidimensional data. We extend the powerful Kolmogorov-Smirnov two sample
test to a high dimensional form in a similar manner to Fasano (Fasano, 1987).
We call our result the d-dimensional Kolmogorov-Smirnov test (ddKS) and provide
three novel contributions therewith: we develop an analytical equation for the
significance of a given ddKS score, we provide an algorithm for computation of
ddKS on modern computing hardware that is of constant time complexity for small
sample sizes and dimensions, and we provide two approximate calculations of
ddKS: one that reduces the time complexity to linear at larger sample sizes,
and another that reduces the time complexity to linear with increasing
dimension. We perform power analysis of ddKS and its approximations on a corpus
of datasets and compare to other common high dimensional two sample tests and
distances: Hotelling's T^2 test and Kullback-Leibler divergence. Our ddKS test
performs well for all datasets, dimensions, and sizes tested, whereas the other
tests and distances fail to reject the null hypothesis on at least one dataset.
We therefore conclude that ddKS is a powerful multidimensional two sample test
for general use, and can be calculated in a fast and efficient manner using our
parallel or approximate methods. Open source implementations of all methods
described in this work are located at https://github.com/pnnl/ddks.
Related papers
- Multi-Dimensional Visual Data Recovery: Scale-Aware Tensor Modeling and Accelerated Randomized Computation [51.65236537605077]
We propose a new type of network compression optimization technique, fully randomized tensor network compression (FCTN)<n>FCTN has significant advantages in correlation characterization and transpositional in algebra, and has notable achievements in multi-dimensional data processing and analysis.<n>We derive efficient algorithms with guarantees to solve the formulated models.
arXiv Detail & Related papers (2026-02-13T14:56:37Z) - Efficient Covariance Estimation for Sparsified Functional Data [51.69796254617083]
proposed Random-knots (Random-knots-Spatial) and B-spline (Bspline-Spatial) estimators of the covariance function are computationally efficient.<n>Asymptotic pointwise of the covariance are obtained for sparsified individual trajectories under some regularity conditions.
arXiv Detail & Related papers (2025-11-23T00:50:33Z) - An Efficient Permutation-Based Kernel Two-Sample Test [13.229867216847534]
Two-sample hypothesis testing is a fundamental problem in statistics and machine learning.
In this work, we use a Nystr"om approximation of the maximum mean discrepancy (MMD) to design a computationally efficient and practical testing algorithm.
arXiv Detail & Related papers (2025-02-19T09:22:48Z) - Parallel Simulation for Log-concave Sampling and Score-based Diffusion Models [55.07411490538404]
We propose a novel parallel sampling method that improves adaptive complexity dependence on dimension $d$.<n>Our approach builds on parallel simulation techniques from scientific computing.
arXiv Detail & Related papers (2024-12-10T11:50:46Z) - Learning Representations for Independence Testing [13.842061060076004]
We show how to construct powerful tests with finite-sample validity using variational estimators of mutual information.<n>Second, we establish a close connection between these variational mutual information-based tests and tests based on the Hilbert-Schmidt Independence Criterion (HSIC)<n>Finally, we show how to, rather than selecting a representation to maximize the statistic itself, select a representation which can maximize the power of a test.
arXiv Detail & Related papers (2024-09-10T22:18:07Z) - Learning Multi-Index Models with Neural Networks via Mean-Field Langevin Dynamics [21.55547541297847]
We study the problem of learning multi-index models in high-dimensions using a two-layer neural network trained with the mean-field Langevin algorithm.
Under mild distributional assumptions, we characterize the effective dimension $d_mathrmeff$ that controls both sample and computational complexity.
arXiv Detail & Related papers (2024-08-14T02:13:35Z) - Efficient Quantum One-Class Support Vector Machines for Anomaly Detection Using Randomized Measurements and Variable Subsampling [5.23043157509344]
Quantum one-class support vector machines leverage the advantage of quantum kernel methods for semi-supervised anomaly detection.
Quantum randomized measurements kernels and variable subsampling were proposed, as two independent methods to address this problem.
The current work focuses instead on combining these two methods, along with rotated feature bagging, to achieve linear time complexity both to data size and to number of features.
arXiv Detail & Related papers (2024-07-30T11:55:52Z) - Computational-Statistical Trade-off in Kernel Two-Sample Testing with Random Fourier Features [3.744589644319257]
The Maximum Mean Discrepancy (MMD) test has emerged as an effective tool for handling complex and high-dimensional data.
It has been unclear whether it is possible to attain the same power guarantee as the MMD test at sub-quadratic time cost.
We show that it is possible to attain the same minimax separation rates as the MMD test within sub-quadratic time.
arXiv Detail & Related papers (2024-07-12T04:08:01Z) - Discovering physical laws with parallel combinatorial tree search [57.05912962368898]
Symbolic regression plays a crucial role in scientific research thanks to its capability of discovering concise and interpretable mathematical expressions from data.
Existing algorithms have faced a critical bottleneck of accuracy and efficiency over a decade.
We introduce a parallel tree search (PCTS) model to efficiently distill generic mathematical expressions from limited data.
arXiv Detail & Related papers (2024-07-05T10:41:15Z) - Computational-Statistical Gaps in Gaussian Single-Index Models [77.1473134227844]
Single-Index Models are high-dimensional regression problems with planted structure.
We show that computationally efficient algorithms, both within the Statistical Query (SQ) and the Low-Degree Polynomial (LDP) framework, necessarily require $Omega(dkstar/2)$ samples.
arXiv Detail & Related papers (2024-03-08T18:50:19Z) - Boosting the Power of Kernel Two-Sample Tests [4.07125466598411]
A kernel two-sample test based on the maximum mean discrepancy (MMD) is one of the most popular methods for detecting differences between two distributions over general metric spaces.
We propose a method to boost the power of the kernel test by combining MMD estimates over multiple kernels using their Mahalanobis distance.
arXiv Detail & Related papers (2023-02-21T14:14:30Z) - Linearized Wasserstein dimensionality reduction with approximation
guarantees [65.16758672591365]
LOT Wassmap is a computationally feasible algorithm to uncover low-dimensional structures in the Wasserstein space.
We show that LOT Wassmap attains correct embeddings and that the quality improves with increased sample size.
We also show how LOT Wassmap significantly reduces the computational cost when compared to algorithms that depend on pairwise distance computations.
arXiv Detail & Related papers (2023-02-14T22:12:16Z) - Efficient Aggregated Kernel Tests using Incomplete $U$-statistics [22.251118308736327]
Three proposed tests aggregate over several kernel bandwidths to detect departures from the null on various scales.
We show that our proposed linear-time aggregated tests obtain higher power than current state-of-the-art linear-time kernel tests.
arXiv Detail & Related papers (2022-06-18T12:30:06Z) - SreaMRAK a Streaming Multi-Resolution Adaptive Kernel Algorithm [60.61943386819384]
Existing implementations of KRR require that all the data is stored in the main memory.
We propose StreaMRAK - a streaming version of KRR.
We present a showcase study on two synthetic problems and the prediction of the trajectory of a double pendulum.
arXiv Detail & Related papers (2021-08-23T21:03:09Z) - Analysis of Truncated Orthogonal Iteration for Sparse Eigenvector
Problems [78.95866278697777]
We propose two variants of the Truncated Orthogonal Iteration to compute multiple leading eigenvectors with sparsity constraints simultaneously.
We then apply our algorithms to solve the sparse principle component analysis problem for a wide range of test datasets.
arXiv Detail & Related papers (2021-03-24T23:11:32Z) - A Local Similarity-Preserving Framework for Nonlinear Dimensionality
Reduction with Neural Networks [56.068488417457935]
We propose a novel local nonlinear approach named Vec2vec for general purpose dimensionality reduction.
To train the neural network, we build the neighborhood similarity graph of a matrix and define the context of data points.
Experiments of data classification and clustering on eight real datasets show that Vec2vec is better than several classical dimensionality reduction methods in the statistical hypothesis test.
arXiv Detail & Related papers (2021-03-10T23:10:47Z) - Stochastic Approximation for Online Tensorial Independent Component
Analysis [98.34292831923335]
Independent component analysis (ICA) has been a popular dimension reduction tool in statistical machine learning and signal processing.
In this paper, we present a by-product online tensorial algorithm that estimates for each independent component.
arXiv Detail & Related papers (2020-12-28T18:52:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.