How Much is Unseen Depends Chiefly on Information About the Seen
- URL: http://arxiv.org/abs/2402.05835v1
- Date: Thu, 8 Feb 2024 17:12:49 GMT
- Title: How Much is Unseen Depends Chiefly on Information About the Seen
- Authors: Seongmin Lee and Marcel B\"ohme
- Abstract summary: We find that the proportion of data points in an unknown population-that belong to classes that do not appear in the training data-is almost entirely determined by the number $f_k$ of classes that do appear in the training data the same number of times.
We develop a genetic algorithm that, given only the sample, searches for an estimator with minimal mean-squared error (MSE)
- Score: 2.169081345816618
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: It might seem counter-intuitive at first: We find that, in expectation, the
proportion of data points in an unknown population-that belong to classes that
do not appear in the training data-is almost entirely determined by the number
$f_k$ of classes that do appear in the training data the same number of times.
While in theory we show that the difference of the induced estimator decays
exponentially in the size of the sample, in practice the high variance prevents
us from using it directly for an estimator of the sample coverage. However, our
precise characterization of the dependency between $f_k$'s induces a large
search space of different representations of the expected value, which can be
deterministically instantiated as estimators. Hence, we turn to optimization
and develop a genetic algorithm that, given only the sample, searches for an
estimator with minimal mean-squared error (MSE). In our experiments, our
genetic algorithm discovers estimators that have a substantially smaller MSE
than the state-of-the-art Good-Turing estimator. This holds for over 96% of
runs when there are at least as many samples as classes. Our estimators' MSE is
roughly 80% of the Good-Turing estimator's.
Related papers
- Collaborative Learning with Different Labeling Functions [7.228285747845779]
We study a variant of Collaborative PAC Learning, in which we aim to learn an accurate classifier for each of the $n$ data distributions.
We show that, when the data distributions satisfy a weaker realizability assumption, sample-efficient learning is still feasible.
arXiv Detail & Related papers (2024-02-16T04:32:22Z) - Nearest Neighbour Score Estimators for Diffusion Generative Models [16.189734871742743]
We introduce a novel nearest neighbour score function estimator which utilizes multiple samples from the training set to dramatically decrease estimator variance.
In diffusion models, we show that our estimator can replace a learned network for probability-flow ODE integration, opening promising new avenues of future research.
arXiv Detail & Related papers (2024-02-12T19:27:30Z) - Detecting Adversarial Data by Probing Multiple Perturbations Using
Expected Perturbation Score [62.54911162109439]
Adversarial detection aims to determine whether a given sample is an adversarial one based on the discrepancy between natural and adversarial distributions.
We propose a new statistic called expected perturbation score (EPS), which is essentially the expected score of a sample after various perturbations.
We develop EPS-based maximum mean discrepancy (MMD) as a metric to measure the discrepancy between the test sample and natural samples.
arXiv Detail & Related papers (2023-05-25T13:14:58Z) - Robust Sparse Mean Estimation via Incremental Learning [15.536082641659423]
In this paper, we study the problem of robust mean estimation, where the goal is to estimate a $k$-sparse mean from a collection of partially corrupted samples.
We present a simple mean estimator that overcomes both challenges under moderate conditions.
Our method does not need any prior knowledge of the sparsity level $k$.
arXiv Detail & Related papers (2023-05-24T16:02:28Z) - Learning to Estimate Without Bias [57.82628598276623]
Gauss theorem states that the weighted least squares estimator is a linear minimum variance unbiased estimation (MVUE) in linear models.
In this paper, we take a first step towards extending this result to non linear settings via deep learning with bias constraints.
A second motivation to BCE is in applications where multiple estimates of the same unknown are averaged for improved performance.
arXiv Detail & Related papers (2021-10-24T10:23:51Z) - SLOE: A Faster Method for Statistical Inference in High-Dimensional
Logistic Regression [68.66245730450915]
We develop an improved method for debiasing predictions and estimating frequentist uncertainty for practical datasets.
Our main contribution is SLOE, an estimator of the signal strength with convergence guarantees that reduces the computation time of estimation and inference by orders of magnitude.
arXiv Detail & Related papers (2021-03-23T17:48:56Z) - DEMI: Discriminative Estimator of Mutual Information [5.248805627195347]
Estimating mutual information between continuous random variables is often intractable and challenging for high-dimensional data.
Recent progress has leveraged neural networks to optimize variational lower bounds on mutual information.
Our approach is based on training a classifier that provides the probability that a data sample pair is drawn from the joint distribution.
arXiv Detail & Related papers (2020-10-05T04:19:27Z) - $\gamma$-ABC: Outlier-Robust Approximate Bayesian Computation Based on a
Robust Divergence Estimator [95.71091446753414]
We propose to use a nearest-neighbor-based $gamma$-divergence estimator as a data discrepancy measure.
Our method achieves significantly higher robustness than existing discrepancy measures.
arXiv Detail & Related papers (2020-06-13T06:09:27Z) - Instability, Computational Efficiency and Statistical Accuracy [101.32305022521024]
We develop a framework that yields statistical accuracy based on interplay between the deterministic convergence rate of the algorithm at the population level, and its degree of (instability) when applied to an empirical object based on $n$ samples.
We provide applications of our general results to several concrete classes of models, including Gaussian mixture estimation, non-linear regression models, and informative non-response models.
arXiv Detail & Related papers (2020-05-22T22:30:52Z) - Nonparametric Estimation of the Fisher Information and Its Applications [82.00720226775964]
This paper considers the problem of estimation of the Fisher information for location from a random sample of size $n$.
An estimator proposed by Bhattacharya is revisited and improved convergence rates are derived.
A new estimator, termed a clipped estimator, is proposed.
arXiv Detail & Related papers (2020-05-07T17:21:56Z) - Quantifying With Only Positive Training Data [0.5735035463793008]
Quantification is the research field that studies methods for counting the number of data points that belong to each class in an unlabeled sample.
This article closes the gap between Positive and Unlabeled Learning (PUL) and One-class Quantification (OCQ)
We compare our method, Passive Aggressive Threshold (PAT), against PUL methods and show that PAT generally is the fastest and most accurate algorithm.
arXiv Detail & Related papers (2020-04-22T01:18:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.