Testing Dependency of Unlabeled Databases
- URL: http://arxiv.org/abs/2311.05874v1
- Date: Fri, 10 Nov 2023 05:17:03 GMT
- Title: Testing Dependency of Unlabeled Databases
- Authors: Vered Paslev and Wasim Huleihel
- Abstract summary: Two random databases $mathsfXinmathcalXntimes d$ and $mathsfYinmathcalYntimes d$ are statistically dependent or not.
We characterize the thresholds at which optimal testing is information-theoretically impossible and possible.
- Score: 5.384630221560811
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we investigate the problem of deciding whether two random
databases $\mathsf{X}\in\mathcal{X}^{n\times d}$ and
$\mathsf{Y}\in\mathcal{Y}^{n\times d}$ are statistically dependent or not. This
is formulated as a hypothesis testing problem, where under the null hypothesis,
these two databases are statistically independent, while under the alternative,
there exists an unknown row permutation $\sigma$, such that $\mathsf{X}$ and
$\mathsf{Y}^\sigma$, a permuted version of $\mathsf{Y}$, are statistically
dependent with some known joint distribution, but have the same marginal
distributions as the null. We characterize the thresholds at which optimal
testing is information-theoretically impossible and possible, as a function of
$n$, $d$, and some spectral properties of the generative distributions of the
datasets. For example, we prove that if a certain function of the eigenvalues
of the likelihood function and $d$, is below a certain threshold, as
$d\to\infty$, then weak detection (performing slightly better than random
guessing) is statistically impossible, no matter what the value of $n$ is. This
mimics the performance of an efficient test that thresholds a centered version
of the log-likelihood function of the observed matrices. We also analyze the
case where $d$ is fixed, for which we derive strong (vanishing error) and weak
detection lower and upper bounds.
Related papers
- Statistical-Computational Trade-offs for Density Estimation [60.81548752871115]
We show that for a broad class of data structures their bounds cannot be significantly improved.
This is a novel emphstatistical-computational trade-off for density estimation.
arXiv Detail & Related papers (2024-10-30T15:03:33Z) - A computational transition for detecting correlated stochastic block models by low-degree polynomials [13.396246336911842]
Detection of correlation in a pair of random graphs is a fundamental statistical and computational problem that has been extensively studied in recent years.
We consider a pair of correlated block models $mathcalS(n,tfraclambdan;k,epsilon;s)$ that are subsampled from a common parent block model $mathcal S(n,tfraclambdan;k,epsilon;s)$
We focus on tests based on emphlow-degrees of the entries of the adjacency
arXiv Detail & Related papers (2024-09-02T06:14:05Z) - Transfer Operators from Batches of Unpaired Points via Entropic
Transport Kernels [3.099885205621181]
We derive a maximum-likelihood inference functional, propose a computationally tractable approximation and analyze their properties.
We prove a $Gamma$-convergence result showing that we can recover the true density from empirical approximations as the number $N$ of blocks goes to infinity.
arXiv Detail & Related papers (2024-02-13T12:52:41Z) - Detection of Correlated Random Vectors [7.320365821066746]
Two random vectors $mathsfXinmathbbRn$ and $mathsfYinmathbbRn$ are correlated or not.
We analyze the thresholds at which optimal testing is information-theoretically impossible and possible.
arXiv Detail & Related papers (2024-01-24T12:58:08Z) - Statistical Learning under Heterogeneous Distribution Shift [71.8393170225794]
Ground-truth predictor is additive $mathbbE[mathbfz mid mathbfx,mathbfy] = f_star(mathbfx) +g_star(mathbfy)$.
arXiv Detail & Related papers (2023-02-27T16:34:21Z) - Phase Transitions in the Detection of Correlated Databases [12.010807505655238]
We study the problem of detecting the correlation between two Gaussian databases $mathsfXinmathbbRntimes d$ and $mathsfYntimes d$, each composed of $n$ users with $d$ features.
This problem is relevant in the analysis of social media, computational biology, etc.
arXiv Detail & Related papers (2023-02-07T10:39:44Z) - Complexity of High-Dimensional Identity Testing with Coordinate Conditional Sampling [5.3098033683382155]
We study the identity testing problem for high-dimensional distributions.
We consider a significantly weaker conditional sampling oracle, which we call the $mathsfCoordinate Oracle$.
We prove that if an analytic property known as approximate tensorization of entropy holds for an $n$-dimensional visible distribution $mu$, then there is an efficient identity testing algorithm for any hidden distribution $pi$ using $tildeO(n/varepsilon)$ queries.
arXiv Detail & Related papers (2022-07-19T06:49:24Z) - Random matrices in service of ML footprint: ternary random features with
no performance loss [55.30329197651178]
We show that the eigenspectrum of $bf K$ is independent of the distribution of the i.i.d. entries of $bf w$.
We propose a novel random technique, called Ternary Random Feature (TRF)
The computation of the proposed random features requires no multiplication and a factor of $b$ less bits for storage compared to classical random features.
arXiv Detail & Related papers (2021-10-05T09:33:49Z) - Spectral properties of sample covariance matrices arising from random
matrices with independent non identically distributed columns [50.053491972003656]
It was previously shown that the functionals $texttr(AR(z))$, for $R(z) = (frac1nXXT- zI_p)-1$ and $Ain mathcal M_p$ deterministic, have a standard deviation of order $O(|A|_* / sqrt n)$.
Here, we show that $|mathbb E[R(z)] - tilde R(z)|_F
arXiv Detail & Related papers (2021-09-06T14:21:43Z) - The Sample Complexity of Robust Covariance Testing [56.98280399449707]
We are given i.i.d. samples from a distribution of the form $Z = (1-epsilon) X + epsilon B$, where $X$ is a zero-mean and unknown covariance Gaussian $mathcalN(0, Sigma)$.
In the absence of contamination, prior work gave a simple tester for this hypothesis testing task that uses $O(d)$ samples.
We prove a sample complexity lower bound of $Omega(d2)$ for $epsilon$ an arbitrarily small constant and $gamma
arXiv Detail & Related papers (2020-12-31T18:24:41Z) - Optimal Testing of Discrete Distributions with High Probability [49.19942805582874]
We study the problem of testing discrete distributions with a focus on the high probability regime.
We provide the first algorithms for closeness and independence testing that are sample-optimal, within constant factors.
arXiv Detail & Related papers (2020-09-14T16:09:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.