Near-Optimal Bounds for Testing Histogram Distributions
- URL: http://arxiv.org/abs/2207.06596v1
- Date: Thu, 14 Jul 2022 01:24:01 GMT
- Title: Near-Optimal Bounds for Testing Histogram Distributions
- Authors: Cl\'ement L. Canonne, Ilias Diakonikolas, Daniel M. Kane, and Sihan
Liu
- Abstract summary: We show that the histogram testing problem has sample complexity $widetilde Theta (sqrtnk / varepsilon + k / varepsilon2 + sqrtn / varepsilon2)$.
- Score: 35.18069719489173
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We investigate the problem of testing whether a discrete probability
distribution over an ordered domain is a histogram on a specified number of
bins. One of the most common tools for the succinct approximation of data,
$k$-histograms over $[n]$, are probability distributions that are piecewise
constant over a set of $k$ intervals. The histogram testing problem is the
following: Given samples from an unknown distribution $\mathbf{p}$ on $[n]$, we
want to distinguish between the cases that $\mathbf{p}$ is a $k$-histogram
versus $\varepsilon$-far from any $k$-histogram, in total variation distance.
Our main result is a sample near-optimal and computationally efficient
algorithm for this testing problem, and a nearly-matching (within logarithmic
factors) sample complexity lower bound. Specifically, we show that the
histogram testing problem has sample complexity $\widetilde \Theta (\sqrt{nk} /
\varepsilon + k / \varepsilon^2 + \sqrt{n} / \varepsilon^2)$.
Related papers
- Dimension-free Private Mean Estimation for Anisotropic Distributions [55.86374912608193]
Previous private estimators on distributions over $mathRd suffer from a curse of dimensionality.
We present an algorithm whose sample complexity has improved dependence on dimension.
arXiv Detail & Related papers (2024-11-01T17:59:53Z) - Testing Identity of Distributions under Kolmogorov Distance in Polylogarithmic Space [1.2277343096128712]
In this paper, we show that much less space suffices -- we give an algorithm that uses space $O(log4 varepsilon-1)$ in the streaming setting.
We also state 9 related open problems that we hope will spark interest in this and related problems.
arXiv Detail & Related papers (2024-10-29T15:24:27Z) - Testing Support Size More Efficiently Than Learning Histograms [0.18416014644193066]
We show that testing can be done more efficiently than learning the histogram of the distribution $p$.
The proof relies on an analysis of Chebyshev approximations outside the range where they are designed to be good approximations.
arXiv Detail & Related papers (2024-10-24T17:05:34Z) - Testing Closeness of Multivariate Distributions via Ramsey Theory [40.926523210945064]
We investigate the statistical task of closeness (or equivalence) testing for multidimensional distributions.
Specifically, given sample access to two unknown distributions $mathbf p, mathbf q$ on $mathbb Rd$, we want to distinguish between the case that $mathbf p=mathbf q$ versus $|mathbf p-mathbf q|_A_k > epsilon$.
Our main result is the first closeness tester for this problem with em sub-learning sample complexity in any fixed dimension.
arXiv Detail & Related papers (2023-11-22T04:34:09Z) - Near-Optimal Bounds for Learning Gaussian Halfspaces with Random
Classification Noise [50.64137465792738]
We show that any efficient SQ algorithm for the problem requires sample complexity at least $Omega(d1/2/(maxp, epsilon)2)$.
Our lower bound suggests that this quadratic dependence on $1/epsilon$ is inherent for efficient algorithms.
arXiv Detail & Related papers (2023-07-13T18:59:28Z) - Anonymized Histograms in Intermediate Privacy Models [54.32252900997422]
We provide an algorithm with a nearly matching error guarantee of $tildeO_varepsilon(sqrtn)$ in the shuffle DP and pan-private models.
Our algorithm is very simple: it just post-processes the discrete Laplace-noised histogram.
arXiv Detail & Related papers (2022-10-27T05:11:00Z) - Gaussian Mean Testing Made Simple [46.03021473600576]
Given i.i.d. samples from a distribution $p$ on $mathbbRd$, the task is to distinguish, with high probability, between the following cases.
We give an extremely simple algorithm for Gaussian mean testing with a one-page analysis.
arXiv Detail & Related papers (2022-10-25T01:59:13Z) - Nonasymptotic one-and two-sample tests in high dimension with unknown
covariance structure [0.0]
We consider the problem of testing if $mu is $eta-close to zero, i.e. $|mu| leq eta against $|mu| geq (eta + delta)$.
The aim of this paper is to obtain nonasymptotic upper and lower bounds on the minimal separation distancedelta$ such that we can control both the Type I and Type II errors at a given level.
arXiv Detail & Related papers (2021-09-01T06:22:53Z) - The Sample Complexity of Robust Covariance Testing [56.98280399449707]
We are given i.i.d. samples from a distribution of the form $Z = (1-epsilon) X + epsilon B$, where $X$ is a zero-mean and unknown covariance Gaussian $mathcalN(0, Sigma)$.
In the absence of contamination, prior work gave a simple tester for this hypothesis testing task that uses $O(d)$ samples.
We prove a sample complexity lower bound of $Omega(d2)$ for $epsilon$ an arbitrarily small constant and $gamma
arXiv Detail & Related papers (2020-12-31T18:24:41Z) - Optimal Testing of Discrete Distributions with High Probability [49.19942805582874]
We study the problem of testing discrete distributions with a focus on the high probability regime.
We provide the first algorithms for closeness and independence testing that are sample-optimal, within constant factors.
arXiv Detail & Related papers (2020-09-14T16:09:17Z) - Locally Private Hypothesis Selection [96.06118559817057]
We output a distribution from $mathcalQ$ whose total variation distance to $p$ is comparable to the best such distribution.
We show that the constraint of local differential privacy incurs an exponential increase in cost.
Our algorithms result in exponential improvements on the round complexity of previous methods.
arXiv Detail & Related papers (2020-02-21T18:30:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.