Testing Dependency of Unlabeled Databases
- URL: http://arxiv.org/abs/2311.05874v1
- Date: Fri, 10 Nov 2023 05:17:03 GMT
- Title: Testing Dependency of Unlabeled Databases
- Authors: Vered Paslev and Wasim Huleihel
- Abstract summary: Two random databases $mathsfXinmathcalXntimes d$ and $mathsfYinmathcalYntimes d$ are statistically dependent or not.
We characterize the thresholds at which optimal testing is information-theoretically impossible and possible.
- Score: 5.384630221560811
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we investigate the problem of deciding whether two random
databases $\mathsf{X}\in\mathcal{X}^{n\times d}$ and
$\mathsf{Y}\in\mathcal{Y}^{n\times d}$ are statistically dependent or not. This
is formulated as a hypothesis testing problem, where under the null hypothesis,
these two databases are statistically independent, while under the alternative,
there exists an unknown row permutation $\sigma$, such that $\mathsf{X}$ and
$\mathsf{Y}^\sigma$, a permuted version of $\mathsf{Y}$, are statistically
dependent with some known joint distribution, but have the same marginal
distributions as the null. We characterize the thresholds at which optimal
testing is information-theoretically impossible and possible, as a function of
$n$, $d$, and some spectral properties of the generative distributions of the
datasets. For example, we prove that if a certain function of the eigenvalues
of the likelihood function and $d$, is below a certain threshold, as
$d\to\infty$, then weak detection (performing slightly better than random
guessing) is statistically impossible, no matter what the value of $n$ is. This
mimics the performance of an efficient test that thresholds a centered version
of the log-likelihood function of the observed matrices. We also analyze the
case where $d$ is fixed, for which we derive strong (vanishing error) and weak
detection lower and upper bounds.
Related papers
- Transfer Operators from Batches of Unpaired Points via Entropic
Transport Kernels [3.099885205621181]
We derive a maximum-likelihood inference functional, propose a computationally tractable approximation and analyze their properties.
We prove a $Gamma$-convergence result showing that we can recover the true density from empirical approximations as the number $N$ of blocks goes to infinity.
arXiv Detail & Related papers (2024-02-13T12:52:41Z) - Detection of Correlated Random Vectors [7.320365821066746]
Two random vectors $mathsfXinmathbbRn$ and $mathsfYinmathbbRn$ are correlated or not.
We analyze the thresholds at which optimal testing is information-theoretically impossible and possible.
arXiv Detail & Related papers (2024-01-24T12:58:08Z) - Estimation and Inference in Distributional Reinforcement Learning [28.253677740976197]
We show that a dataset of size $widetilde Oleft(frac|mathcalS||mathcalA|epsilon2 (1-gamma)4right)$ suffices to ensure the Kolmogorov metric and total variation metric between $hatetapi$ and $etapi$ is below $epsilon$ with high probability.
Our findings give rise to a unified approach to statistical inference of a wide class of statistical functionals of $etapi$.
arXiv Detail & Related papers (2023-09-29T14:14:53Z) - Statistical Learning under Heterogeneous Distribution Shift [71.8393170225794]
Ground-truth predictor is additive $mathbbE[mathbfz mid mathbfx,mathbfy] = f_star(mathbfx) +g_star(mathbfy)$.
arXiv Detail & Related papers (2023-02-27T16:34:21Z) - Phase Transitions in the Detection of Correlated Databases [12.010807505655238]
We study the problem of detecting the correlation between two Gaussian databases $mathsfXinmathbbRntimes d$ and $mathsfYntimes d$, each composed of $n$ users with $d$ features.
This problem is relevant in the analysis of social media, computational biology, etc.
arXiv Detail & Related papers (2023-02-07T10:39:44Z) - Random matrices in service of ML footprint: ternary random features with
no performance loss [55.30329197651178]
We show that the eigenspectrum of $bf K$ is independent of the distribution of the i.i.d. entries of $bf w$.
We propose a novel random technique, called Ternary Random Feature (TRF)
The computation of the proposed random features requires no multiplication and a factor of $b$ less bits for storage compared to classical random features.
arXiv Detail & Related papers (2021-10-05T09:33:49Z) - Spectral properties of sample covariance matrices arising from random
matrices with independent non identically distributed columns [50.053491972003656]
It was previously shown that the functionals $texttr(AR(z))$, for $R(z) = (frac1nXXT- zI_p)-1$ and $Ain mathcal M_p$ deterministic, have a standard deviation of order $O(|A|_* / sqrt n)$.
Here, we show that $|mathbb E[R(z)] - tilde R(z)|_F
arXiv Detail & Related papers (2021-09-06T14:21:43Z) - The Sample Complexity of Robust Covariance Testing [56.98280399449707]
We are given i.i.d. samples from a distribution of the form $Z = (1-epsilon) X + epsilon B$, where $X$ is a zero-mean and unknown covariance Gaussian $mathcalN(0, Sigma)$.
In the absence of contamination, prior work gave a simple tester for this hypothesis testing task that uses $O(d)$ samples.
We prove a sample complexity lower bound of $Omega(d2)$ for $epsilon$ an arbitrarily small constant and $gamma
arXiv Detail & Related papers (2020-12-31T18:24:41Z) - Optimal Testing of Discrete Distributions with High Probability [49.19942805582874]
We study the problem of testing discrete distributions with a focus on the high probability regime.
We provide the first algorithms for closeness and independence testing that are sample-optimal, within constant factors.
arXiv Detail & Related papers (2020-09-14T16:09:17Z) - Agnostic Learning of a Single Neuron with Gradient Descent [92.7662890047311]
We consider the problem of learning the best-fitting single neuron as measured by the expected square loss.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
For the ReLU activation, our population risk guarantee is $O(mathsfOPT1/2)+epsilon$.
arXiv Detail & Related papers (2020-05-29T07:20:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.