Generalized Multivariate Signs for Nonparametric Hypothesis Testing in
High Dimensions
- URL: http://arxiv.org/abs/2107.01103v1
- Date: Fri, 2 Jul 2021 14:31:44 GMT
- Title: Generalized Multivariate Signs for Nonparametric Hypothesis Testing in
High Dimensions
- Authors: Subhabrata Majumdar, Snigdhansu Chatterjee
- Abstract summary: We show that tests using generalized signs display higher power than existing tests, while maintaining nominal type-I error rates.
We provide example applications on the MNIST and Minnesota Twin Studies genomic data.
- Score: 4.24243593213882
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: High-dimensional data, where the dimension of the feature space is much
larger than sample size, arise in a number of statistical applications. In this
context, we construct the generalized multivariate sign transformation, defined
as a vector divided by its norm. For different choices of the norm function,
the resulting transformed vector adapts to certain geometrical features of the
data distribution. Building up on this idea, we obtain one-sample and
two-sample testing procedures for mean vectors of high-dimensional data using
these generalized sign vectors. These tests are based on U-statistics using
kernel inner products, do not require prohibitive assumptions, and are amenable
to a fast randomization-based implementation. Through experiments in a number
of data settings, we show that tests using generalized signs display higher
power than existing tests, while maintaining nominal type-I error rates.
Finally, we provide example applications on the MNIST and Minnesota Twin
Studies genomic data.
Related papers
- Downstream-Pretext Domain Knowledge Traceback for Active Learning [138.02530777915362]
We propose a downstream-pretext domain knowledge traceback (DOKT) method that traces the data interactions of downstream knowledge and pre-training guidance.
DOKT consists of a traceback diversity indicator and a domain-based uncertainty estimator.
Experiments conducted on ten datasets show that our model outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2024-07-20T01:34:13Z) - Computational-Statistical Gaps in Gaussian Single-Index Models [77.1473134227844]
Single-Index Models are high-dimensional regression problems with planted structure.
We show that computationally efficient algorithms, both within the Statistical Query (SQ) and the Low-Degree Polynomial (LDP) framework, necessarily require $Omega(dkstar/2)$ samples.
arXiv Detail & Related papers (2024-03-08T18:50:19Z) - Testing multivariate normality by testing independence [0.0]
We propose a simple multivariate normality test based on Kac-Bernstein's characterization, which can be conducted by utilising existing statistical independence tests for sums and differences of data samples.
We also perform its empirical investigation, which reveals that for high-dimensional data, the proposed approach may be more efficient than the alternative ones.
arXiv Detail & Related papers (2023-11-20T07:19:52Z) - On Extreme Value Asymptotics of Projected Sample Covariances in High
Dimensions with Applications in Finance and Convolutional Networks [0.0]
We show that Gumbel-type extreme values holds true within a linear time series framework.
As applications we discuss long-only mimimal-variance portfolio optimization and sub-portfolio analysis with respect to idiosyncratic risks.
arXiv Detail & Related papers (2023-10-12T09:17:46Z) - A framework for paired-sample hypothesis testing for high-dimensional
data [7.400168551191579]
We put forward the idea that scoring functions can be produced by the decision rules defined by the bisecting hyperplanes of the line segments connecting each pair of instances.
First, we estimate the bisecting hyperplanes for each pair of instances and an aggregated rule derived through the Hodges-Lehmann estimator.
arXiv Detail & Related papers (2023-09-28T09:17:11Z) - Intrinsic dimension estimation for discrete metrics [65.5438227932088]
In this letter we introduce an algorithm to infer the intrinsic dimension (ID) of datasets embedded in discrete spaces.
We demonstrate its accuracy on benchmark datasets, and we apply it to analyze a metagenomic dataset for species fingerprinting.
This suggests that evolutive pressure acts on a low-dimensional manifold despite the high-dimensionality of sequences' space.
arXiv Detail & Related papers (2022-07-20T06:38:36Z) - Predicting Out-of-Domain Generalization with Neighborhood Invariance [59.05399533508682]
We propose a measure of a classifier's output invariance in a local transformation neighborhood.
Our measure is simple to calculate, does not depend on the test point's true label, and can be applied even in out-of-domain (OOD) settings.
In experiments on benchmarks in image classification, sentiment analysis, and natural language inference, we demonstrate a strong and robust correlation between our measure and actual OOD generalization.
arXiv Detail & Related papers (2022-07-05T14:55:16Z) - Toward Learning Robust and Invariant Representations with Alignment
Regularization and Data Augmentation [76.85274970052762]
This paper is motivated by a proliferation of options of alignment regularizations.
We evaluate the performances of several popular design choices along the dimensions of robustness and invariance.
We also formally analyze the behavior of alignment regularization to complement our empirical study under assumptions we consider realistic.
arXiv Detail & Related papers (2022-06-04T04:29:19Z) - Estimating Graph Dimension with Cross-validated Eigenvalues [5.0013150536632995]
In applied statistics, estimating the number of latent dimensions or the number of clusters is a fundamental and recurring problem.
We provide a cross-validated eigenvalues approach to this problem.
We prove that our procedure consistently estimates $k$ in scenarios where all $k$ dimensions can be estimated.
arXiv Detail & Related papers (2021-08-06T23:52:30Z) - Double Generative Adversarial Networks for Conditional Independence
Testing [8.359770027722275]
High-dimensional conditional independence testing is a key building block in statistics and machine learning.
We propose an inferential procedure based on double generative adversarial networks (GANs)
arXiv Detail & Related papers (2020-06-03T16:14:15Z) - Asymptotic Analysis of an Ensemble of Randomly Projected Linear
Discriminants [94.46276668068327]
In [1], an ensemble of randomly projected linear discriminants is used to classify datasets.
We develop a consistent estimator of the misclassification probability as an alternative to the computationally-costly cross-validation estimator.
We also demonstrate the use of our estimator for tuning the projection dimension on both real and synthetic data.
arXiv Detail & Related papers (2020-04-17T12:47:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.