On the Statistical Complexity of Sample Amplification
- URL: http://arxiv.org/abs/2201.04315v2
- Date: Wed, 18 Sep 2024 02:56:59 GMT
- Title: On the Statistical Complexity of Sample Amplification
- Authors: Brian Axelrod, Shivam Garg, Yanjun Han, Vatsal Sharan, Gregory Valiant,
- Abstract summary: We show that it is possible to produce a larger set of $n+m$ i.i.d. samples which cannot be distinguished from $n+m$ i.i.d. samples drawn from $P$.
Our techniques apply to a large class of distributions including the exponential family, and establish a rigorous connection between sample amplification and distribution learning.
- Score: 26.78883878113411
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The ``sample amplification'' problem formalizes the following question: Given $n$ i.i.d. samples drawn from an unknown distribution $P$, when is it possible to produce a larger set of $n+m$ samples which cannot be distinguished from $n+m$ i.i.d. samples drawn from $P$? In this work, we provide a firm statistical foundation for this problem by deriving generally applicable amplification procedures, lower bound techniques and connections to existing statistical notions. Our techniques apply to a large class of distributions including the exponential family, and establish a rigorous connection between sample amplification and distribution learning.
Related papers
- Wedge Sampling: Efficient Tensor Completion with Nearly-Linear Sample Complexity [9.42598427201735]
We introduce Wedge Sampling, a new non-adaptive sampling scheme for low-rank tensor completion.<n>We study recovery of an order-$k low-rank tensor of dimension $n times cdots times n$ from a subset of its entries.
arXiv Detail & Related papers (2026-02-05T16:47:13Z) - The Sample Complexity of Membership Inference and Privacy Auditing [7.865646654746587]
A membership-inference attack gets the output of a learning algorithm, and a target individual, and tries to determine whether this individual is a member of the training data or an independent sample from the same distribution.<n>A successful membership-inference attack typically requires the attacker to have some knowledge about the distribution that the training data was sampled from.<n>We show that for membership inference in this setting, $Omega(n + n2 rho2)$ samples can be necessary to carry out any attack that competes with a fully informed attacker.
arXiv Detail & Related papers (2025-08-26T22:19:28Z) - A Distributional-Lifting Theorem for PAC Learning [16.985620991607345]
Distributional assumptions facilitate the design of efficient algorithms but limit their reach and relevance.<n>Recent work of Blanc, Lange, Malik, and Tan considered the special case of lifting uniform-distribution learners.<n>We show that their approach is information-theoretically intractable with access only to random examples.<n>We then take a different approach that sidesteps the need to learn $Dstar$, yielding a lifter that works in the standard PAC model.
arXiv Detail & Related papers (2025-06-19T23:28:38Z) - Efficiently learning and sampling multimodal distributions with data-based initialization [20.575122468674536]
We consider the problem of sampling a multimodal distribution with a Markov chain given a small number of samples from the stationary measure.
We show that if the Markov chain has a $k$th order spectral gap, samples from the stationary distribution will efficiently generate a sample whose conditional law is $varepsilon$-close in TV distance to the stationary measure.
arXiv Detail & Related papers (2024-11-14T01:37:02Z) - Sum-of-squares lower bounds for Non-Gaussian Component Analysis [33.80749804695003]
Non-Gaussian Component Analysis (NGCA) is the statistical task of finding a non-Gaussian direction in a high-dimensional dataset.
Here we study the complexity of NGCA in the Sum-of-Squares framework.
arXiv Detail & Related papers (2024-10-28T18:19:13Z) - Large Language Monkeys: Scaling Inference Compute with Repeated Sampling [81.34900892130929]
We explore inference compute as another axis for scaling, using the simple technique of repeatedly sampling candidate solutions from a model.
Across multiple tasks and models, we observe that coverage scales with the number of samples over four orders of magnitude.
In domains like coding and formal proofs, where answers can be automatically verified, these increases in coverage directly translate into improved performance.
arXiv Detail & Related papers (2024-07-31T17:57:25Z) - Which Pretrain Samples to Rehearse when Finetuning Pretrained Models? [60.59376487151964]
Fine-tuning pretrained models on specific tasks is now the de facto approach for text and vision tasks.
A known pitfall of this approach is the forgetting of pretraining knowledge that happens during finetuning.
We propose a novel sampling scheme, mix-cd, that identifies and prioritizes samples that actually face forgetting.
arXiv Detail & Related papers (2024-02-12T22:32:12Z) - Simple and effective data augmentation for compositional generalization [64.00420578048855]
We show that data augmentation methods that sample MRs and backtranslate them can be effective for compositional generalization.
Remarkably, sampling from a uniform distribution performs almost as well as sampling from the test distribution.
arXiv Detail & Related papers (2024-01-18T09:13:59Z) - The sample complexity of multi-distribution learning [17.45683822446751]
We show that an algorithm of sample complexity $widetildeO((d+k)epsilon-2) cdot (k/epsilon)o(1)$ resolves the COLT 2023 open problem of Awasthi, Haghtalab and Zhao.
arXiv Detail & Related papers (2023-12-07T03:53:17Z) - On-Demand Sampling: Learning Optimally from Multiple Distributions [63.20009081099896]
Social and real-world considerations have given rise to multi-distribution learning paradigms.
We establish the optimal sample complexity of these learning paradigms and give algorithms that meet this sample complexity.
Our algorithm design and analysis are enabled by our extensions of online learning techniques for solving zero-sum games.
arXiv Detail & Related papers (2022-10-22T19:07:26Z) - Bias Mimicking: A Simple Sampling Approach for Bias Mitigation [57.17709477668213]
We introduce a new class-conditioned sampling method: Bias Mimicking.
Bias Mimicking improves underrepresented groups' accuracy of sampling methods by 3% over four benchmarks.
arXiv Detail & Related papers (2022-09-30T17:33:00Z) - The Sample Complexity of Robust Covariance Testing [56.98280399449707]
We are given i.i.d. samples from a distribution of the form $Z = (1-epsilon) X + epsilon B$, where $X$ is a zero-mean and unknown covariance Gaussian $mathcalN(0, Sigma)$.
In the absence of contamination, prior work gave a simple tester for this hypothesis testing task that uses $O(d)$ samples.
We prove a sample complexity lower bound of $Omega(d2)$ for $epsilon$ an arbitrarily small constant and $gamma
arXiv Detail & Related papers (2020-12-31T18:24:41Z) - Optimal Testing of Discrete Distributions with High Probability [49.19942805582874]
We study the problem of testing discrete distributions with a focus on the high probability regime.
We provide the first algorithms for closeness and independence testing that are sample-optimal, within constant factors.
arXiv Detail & Related papers (2020-09-14T16:09:17Z) - Weighted Empirical Risk Minimization: Sample Selection Bias Correction
based on Importance Sampling [2.599882743586164]
We consider statistical learning problems when the distribution $P'$ of the training observations $Z'_i$ differs from the distribution $P'$ involved in the risk one seeks to minimize.
We show that, in various situations frequently encountered in practice, it takes a simple form and can be directly estimated from the $Phi(z)$.
We then prove that the capacity generalization of the approach aforementioned is preserved when plugging the resulting estimates of the $Phi(Z'_i)$'s into the weighted empirical risk.
arXiv Detail & Related papers (2020-02-12T18:42:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.