Testing with Non-identically Distributed Samples
- URL: http://arxiv.org/abs/2311.11194v1
- Date: Sun, 19 Nov 2023 01:25:50 GMT
- Title: Testing with Non-identically Distributed Samples
- Authors: Shivam Garg, Chirag Pabbaraju, Kirankumar Shiragur, Gregory Valiant
- Abstract summary: We examine the extent to which sublinear-sample property testing and estimation applies to settings where samples are independently but not identically distributed.
Even with $Theta(k/varepsilon2)$ samples from each distribution, $textbfp_mathrmavg$ is sufficient to learn $textbfp_mathrmavg$ to within error $varepsilon$ in TV distance.
- Score: 20.74768558932617
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We examine the extent to which sublinear-sample property testing and
estimation applies to settings where samples are independently but not
identically distributed. Specifically, we consider the following distributional
property testing framework: Suppose there is a set of distributions over a
discrete support of size $k$, $\textbf{p}_1, \textbf{p}_2,\ldots,\textbf{p}_T$,
and we obtain $c$ independent draws from each distribution. Suppose the goal is
to learn or test a property of the average distribution,
$\textbf{p}_{\mathrm{avg}}$. This setup models a number of important practical
settings where the individual distributions correspond to heterogeneous
entities -- either individuals, chronologically distinct time periods,
spatially separated data sources, etc. From a learning standpoint, even with
$c=1$ samples from each distribution, $\Theta(k/\varepsilon^2)$ samples are
necessary and sufficient to learn $\textbf{p}_{\mathrm{avg}}$ to within error
$\varepsilon$ in TV distance. To test uniformity or identity -- distinguishing
the case that $\textbf{p}_{\mathrm{avg}}$ is equal to some reference
distribution, versus has $\ell_1$ distance at least $\varepsilon$ from the
reference distribution, we show that a linear number of samples in $k$ is
necessary given $c=1$ samples from each distribution. In contrast, for $c \ge
2$, we recover the usual sublinear sample testing of the i.i.d. setting: we
show that $O(\sqrt{k}/\varepsilon^2 + 1/\varepsilon^4)$ samples are sufficient,
matching the optimal sample complexity in the i.i.d. case in the regime where
$\varepsilon \ge k^{-1/4}$. Additionally, we show that in the $c=2$ case, there
is a constant $\rho > 0$ such that even in the linear regime with $\rho k$
samples, no tester that considers the multiset of samples (ignoring which
samples were drawn from the same $\textbf{p}_i$) can perform uniformity
testing.
Related papers
- Dimension-free Private Mean Estimation for Anisotropic Distributions [55.86374912608193]
Previous private estimators on distributions over $mathRd suffer from a curse of dimensionality.
We present an algorithm whose sample complexity has improved dependence on dimension.
arXiv Detail & Related papers (2024-11-01T17:59:53Z) - Statistical-Computational Trade-offs for Density Estimation [60.81548752871115]
We show that for a broad class of data structures their bounds cannot be significantly improved.
This is a novel emphstatistical-computational trade-off for density estimation.
arXiv Detail & Related papers (2024-10-30T15:03:33Z) - Outlier Robust Multivariate Polynomial Regression [27.03423421704806]
We are given a set of random samples $(mathbfx_i,y_i) in [-1,1]n times mathbbR$ that are noisy versions of $(mathbfx_i,p(mathbfx_i)$.
The goal is to output a $hatp$, within an $ell_in$-distance of at most $O(sigma)$ from $p$.
arXiv Detail & Related papers (2024-03-14T15:04:45Z) - Stochastic Approximation Approaches to Group Distributionally Robust
Optimization [96.26317627118912]
Group distributionally robust optimization (GDRO)
Online learning techniques to reduce the number of samples required in each round from $m$ to $1$, keeping the same sample.
A novel formulation of weighted GDRO, which allows us to derive distribution-dependent convergence rates.
arXiv Detail & Related papers (2023-02-18T09:24:15Z) - Near-Optimal Bounds for Testing Histogram Distributions [35.18069719489173]
We show that the histogram testing problem has sample complexity $widetilde Theta (sqrtnk / varepsilon + k / varepsilon2 + sqrtn / varepsilon2)$.
arXiv Detail & Related papers (2022-07-14T01:24:01Z) - Independence Testing for Bounded Degree Bayesian Network [4.230271396864461]
We show that if $P$ has a sparse structure, then in fact only linearly many samples are required.
We also show that if $P$ is Markov with respect to a Bayesian network whose underlying DAG has in-degree bounded by $d$, then $tildeTheta (2d/2cdot n/varepsilon2)$ samples are necessary.
arXiv Detail & Related papers (2022-04-19T06:16:14Z) - Tight Bounds on the Hardness of Learning Simple Nonparametric Mixtures [9.053430799456587]
We study the problem of learning nonparametric distributions in a finite mixture.
We establish tight bounds on the sample complexity for learning the component distributions in such models.
arXiv Detail & Related papers (2022-03-28T23:53:48Z) - The Sample Complexity of Robust Covariance Testing [56.98280399449707]
We are given i.i.d. samples from a distribution of the form $Z = (1-epsilon) X + epsilon B$, where $X$ is a zero-mean and unknown covariance Gaussian $mathcalN(0, Sigma)$.
In the absence of contamination, prior work gave a simple tester for this hypothesis testing task that uses $O(d)$ samples.
We prove a sample complexity lower bound of $Omega(d2)$ for $epsilon$ an arbitrarily small constant and $gamma
arXiv Detail & Related papers (2020-12-31T18:24:41Z) - Optimal Testing of Discrete Distributions with High Probability [49.19942805582874]
We study the problem of testing discrete distributions with a focus on the high probability regime.
We provide the first algorithms for closeness and independence testing that are sample-optimal, within constant factors.
arXiv Detail & Related papers (2020-09-14T16:09:17Z) - Locally Private Hypothesis Selection [96.06118559817057]
We output a distribution from $mathcalQ$ whose total variation distance to $p$ is comparable to the best such distribution.
We show that the constraint of local differential privacy incurs an exponential increase in cost.
Our algorithms result in exponential improvements on the round complexity of previous methods.
arXiv Detail & Related papers (2020-02-21T18:30:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.