Estimating the size of a set using cascading exclusion
- URL: http://arxiv.org/abs/2508.05901v2
- Date: Mon, 08 Sep 2025 21:17:28 GMT
- Title: Estimating the size of a set using cascading exclusion
- Authors: Sourav Chatterjee, Persi Diaconis, Susan Holmes,
- Abstract summary: A general non-asymptotic theory is developed.<n>A general theorem gives non-parametric finite $n$ error bounds in all cases.
- Score: 2.1665689529884697
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Let $S$ be a finite set, and $X_1,\ldots,X_n$ an i.i.d. uniform sample from $S$. To estimate the size $|S|$, without further structure, one can wait for repeats and use the birthday problem. This requires a sample size of the order $|S|^\frac{1}{2}$. On the other hand, if $S=\{1,2,\ldots,|S|\}$, the maximum of the sample blown up by $n/(n-1)$ gives an efficient estimator based on any growing sample size. This paper gives refinements that interpolate between these extremes. A general non-asymptotic theory is developed. This includes estimating the volume of a compact convex set, the unseen species problem, and a host of testing problems that follow from the question `Is this new observation a typical pick from a large prespecified population?' We also treat regression style predictors. A general theorem gives non-parametric finite $n$ error bounds in all cases.
Related papers
- Dimension-free Private Mean Estimation for Anisotropic Distributions [55.86374912608193]
Previous private estimators on distributions over $mathRd suffer from a curse of dimensionality.
We present an algorithm whose sample complexity has improved dependence on dimension.
arXiv Detail & Related papers (2024-11-01T17:59:53Z) - Sum-of-squares lower bounds for Non-Gaussian Component Analysis [33.80749804695003]
Non-Gaussian Component Analysis (NGCA) is the statistical task of finding a non-Gaussian direction in a high-dimensional dataset.
Here we study the complexity of NGCA in the Sum-of-Squares framework.
arXiv Detail & Related papers (2024-10-28T18:19:13Z) - Measuring quantum relative entropy with finite-size effect [53.64687146666141]
We study the estimation of relative entropy $D(rho|sigma)$ when $sigma$ is known.<n>Our estimator attains the Cram'er-Rao type bound when the dimension $d$ is fixed.
arXiv Detail & Related papers (2024-06-25T06:07:20Z) - The Sample Complexity of Simple Binary Hypothesis Testing [7.127829790714167]
The sample complexity of simple binary hypothesis testing is the smallest number of i.i.d. samples required to distinguish between two distributions $p$ and $q$ in either setting.<n>We derive a formula that characterizes the sample complexity (up to multiplicative constants that are independent of $p$, $q$, and all error parameters) for: (i) all $0 le alpha, beta le 1/8$ in the prior-free setting; and (ii) all $delta le pi/4$ in the Bayesian setting.
arXiv Detail & Related papers (2024-03-25T17:42:32Z) - Minimax Optimality of Score-based Diffusion Models: Beyond the Density Lower Bound Assumptions [11.222970035173372]
kernel-based score estimator achieves an optimal mean square error of $widetildeOleft(n-1 t-fracd+22(tfracd2 vee 1)right)
We show that a kernel-based score estimator achieves an optimal mean square error of $widetildeOleft(n-1/2 t-fracd4right)$ upper bound for the total variation error of the distribution of the sample generated by the diffusion model under a mere sub-Gaussian
arXiv Detail & Related papers (2024-02-23T20:51:31Z) - Estimating the Mixing Coefficients of Geometrically Ergodic Markov
Processes [5.00389879175348]
We estimate the individual $beta$-mixing coefficients of a real-valued geometrically ergodic Markov process from a single sample-path.
Naturally no density assumptions are required in this setting; the expected error rate is shown to be of order $mathcal O(log(n) n-1/2)$.
arXiv Detail & Related papers (2024-02-11T20:17:10Z) - Distribution-Independent Regression for Generalized Linear Models with
Oblivious Corruptions [49.69852011882769]
We show the first algorithms for the problem of regression for generalized linear models (GLMs) in the presence of additive oblivious noise.
We present an algorithm that tackles newthis problem in its most general distribution-independent setting.
This is the first newalgorithmic result for GLM regression newwith oblivious noise which can handle more than half the samples being arbitrarily corrupted.
arXiv Detail & Related papers (2023-09-20T21:41:59Z) - On the Multidimensional Random Subset Sum Problem [0.9007371440329465]
In the Random Subset Sum Problem, given $n$ i.i.d. random variables $X_1,..., X_n$, we wish to approximate any point $z in [-1,1]$ as the sum of a subset $X_i_1(z),..., X_i_s(z)$ of them, up to error $varepsilon cdot.
We prove that, in $d$ dimensions, $n = O(d3log frac 1varepsilon cdot
arXiv Detail & Related papers (2022-07-28T08:10:43Z) - The Sample Complexity of Robust Covariance Testing [56.98280399449707]
We are given i.i.d. samples from a distribution of the form $Z = (1-epsilon) X + epsilon B$, where $X$ is a zero-mean and unknown covariance Gaussian $mathcalN(0, Sigma)$.
In the absence of contamination, prior work gave a simple tester for this hypothesis testing task that uses $O(d)$ samples.
We prove a sample complexity lower bound of $Omega(d2)$ for $epsilon$ an arbitrarily small constant and $gamma
arXiv Detail & Related papers (2020-12-31T18:24:41Z) - Optimal Testing of Discrete Distributions with High Probability [49.19942805582874]
We study the problem of testing discrete distributions with a focus on the high probability regime.
We provide the first algorithms for closeness and independence testing that are sample-optimal, within constant factors.
arXiv Detail & Related papers (2020-09-14T16:09:17Z) - Efficient Statistics for Sparse Graphical Models from Truncated Samples [19.205541380535397]
We focus on two fundamental and classical problems: (i) inference of sparse Gaussian graphical models and (ii) support recovery of sparse linear models.
For sparse linear regression, suppose samples $(bf x,y)$ are generated where $y = bf xtopOmega* + mathcalN(0,1)$ and $(bf x, y)$ is seen only if $y$ belongs to a truncation set $S subseteq mathbbRd$.
arXiv Detail & Related papers (2020-06-17T09:21:00Z) - Locally Private Hypothesis Selection [96.06118559817057]
We output a distribution from $mathcalQ$ whose total variation distance to $p$ is comparable to the best such distribution.
We show that the constraint of local differential privacy incurs an exponential increase in cost.
Our algorithms result in exponential improvements on the round complexity of previous methods.
arXiv Detail & Related papers (2020-02-21T18:30:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.