Phase Transitions in the Detection of Correlated Databases
- URL: http://arxiv.org/abs/2302.03380v2
- Date: Thu, 4 May 2023 06:19:51 GMT
- Title: Phase Transitions in the Detection of Correlated Databases
- Authors: Dor Elimelech and Wasim Huleihel
- Abstract summary: We study the problem of detecting the correlation between two Gaussian databases $mathsfXinmathbbRntimes d$ and $mathsfYntimes d$, each composed of $n$ users with $d$ features.
This problem is relevant in the analysis of social media, computational biology, etc.
- Score: 12.010807505655238
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the problem of detecting the correlation between two Gaussian
databases $\mathsf{X}\in\mathbb{R}^{n\times d}$ and $\mathsf{Y}^{n\times d}$,
each composed of $n$ users with $d$ features. This problem is relevant in the
analysis of social media, computational biology, etc. We formulate this as a
hypothesis testing problem: under the null hypothesis, these two databases are
statistically independent. Under the alternative, however, there exists an
unknown permutation $\sigma$ over the set of $n$ users (or, row permutation),
such that $\mathsf{X}$ is $\rho$-correlated with $\mathsf{Y}^\sigma$, a
permuted version of $\mathsf{Y}$. We determine sharp thresholds at which
optimal testing exhibits a phase transition, depending on the asymptotic regime
of $n$ and $d$. Specifically, we prove that if $\rho^2d\to0$, as $d\to\infty$,
then weak detection (performing slightly better than random guessing) is
statistically impossible, irrespectively of the value of $n$. This compliments
the performance of a simple test that thresholds the sum all entries of
$\mathsf{X}^T\mathsf{Y}$. Furthermore, when $d$ is fixed, we prove that strong
detection (vanishing error probability) is impossible for any
$\rho<\rho^\star$, where $\rho^\star$ is an explicit function of $d$, while
weak detection is again impossible as long as $\rho^2d\to0$. These results
close significant gaps in current recent related studies.
Related papers
- Guarantees for Nonlinear Representation Learning: Non-identical Covariates, Dependent Data, Fewer Samples [24.45016514352055]
We study the sample-complexity of learning $T+1$ functions $f_star(t) circ g_star$ from a function class $mathcal F times mathcal G$.
We show that as the number of tasks $T$ increases, both the sample requirement and risk bound converge to that of $r$-dimensional regression.
arXiv Detail & Related papers (2024-10-15T03:20:19Z) - Outlier Robust Multivariate Polynomial Regression [27.03423421704806]
We are given a set of random samples $(mathbfx_i,y_i) in [-1,1]n times mathbbR$ that are noisy versions of $(mathbfx_i,p(mathbfx_i)$.
The goal is to output a $hatp$, within an $ell_in$-distance of at most $O(sigma)$ from $p$.
arXiv Detail & Related papers (2024-03-14T15:04:45Z) - Testing Dependency of Unlabeled Databases [5.384630221560811]
Two random databases $mathsfXinmathcalXntimes d$ and $mathsfYinmathcalYntimes d$ are statistically dependent or not.
We characterize the thresholds at which optimal testing is information-theoretically impossible and possible.
arXiv Detail & Related papers (2023-11-10T05:17:03Z) - Statistical Learning under Heterogeneous Distribution Shift [71.8393170225794]
Ground-truth predictor is additive $mathbbE[mathbfz mid mathbfx,mathbfy] = f_star(mathbfx) +g_star(mathbfy)$.
arXiv Detail & Related papers (2023-02-27T16:34:21Z) - Nonasymptotic one-and two-sample tests in high dimension with unknown
covariance structure [0.0]
We consider the problem of testing if $mu is $eta-close to zero, i.e. $|mu| leq eta against $|mu| geq (eta + delta)$.
The aim of this paper is to obtain nonasymptotic upper and lower bounds on the minimal separation distancedelta$ such that we can control both the Type I and Type II errors at a given level.
arXiv Detail & Related papers (2021-09-01T06:22:53Z) - Threshold Phenomena in Learning Halfspaces with Massart Noise [56.01192577666607]
We study the problem of PAC learning halfspaces on $mathbbRd$ with Massart noise under Gaussian marginals.
Our results qualitatively characterize the complexity of learning halfspaces in the Massart model.
arXiv Detail & Related papers (2021-08-19T16:16:48Z) - Mediated Uncoupled Learning: Learning Functions without Direct
Input-output Correspondences [80.95776331769899]
We consider the task of predicting $Y$ from $X$ when we have no paired data of them.
A naive approach is to predict $U$ from $X$ using $S_X$ and then $Y$ from $U$ using $S_Y$.
We propose a new method that avoids predicting $U$ but directly learns $Y = f(X)$ by training $f(X)$ with $S_X$ to predict $h(U)$.
arXiv Detail & Related papers (2021-07-16T22:13:29Z) - Optimal Spectral Recovery of a Planted Vector in a Subspace [80.02218763267992]
We study efficient estimation and detection of a planted vector $v$ whose $ell_4$ norm differs from that of a Gaussian vector with the same $ell$ norm.
We show that in the regime $n rho gg sqrtN$, any spectral method from a large class (and more generally, any low-degree of the input) fails to detect the planted vector.
arXiv Detail & Related papers (2021-05-31T16:10:49Z) - The Sparse Hausdorff Moment Problem, with Application to Topic Models [5.151973524974052]
We give an algorithm for identifying a $k$-mixture using samples of $m=2k$ iid binary random variables.
It suffices to know the moments to additive accuracy $w_mincdotzetaO(k)$.
arXiv Detail & Related papers (2020-07-16T04:23:57Z) - Locally Private Hypothesis Selection [96.06118559817057]
We output a distribution from $mathcalQ$ whose total variation distance to $p$ is comparable to the best such distribution.
We show that the constraint of local differential privacy incurs an exponential increase in cost.
Our algorithms result in exponential improvements on the round complexity of previous methods.
arXiv Detail & Related papers (2020-02-21T18:30:48Z) - Agnostic Q-learning with Function Approximation in Deterministic
Systems: Tight Bounds on Approximation Error and Sample Complexity [94.37110094442136]
We study the problem of agnostic $Q$-learning with function approximation in deterministic systems.
We show that if $delta = Oleft(rho/sqrtdim_Eright)$, then one can find the optimal policy using $Oleft(dim_Eright)$.
arXiv Detail & Related papers (2020-02-17T18:41:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.