How Much is Unseen Depends Chiefly on Information About the Seen
- URL: http://arxiv.org/abs/2402.05835v2
- Date: Sun, 09 Mar 2025 20:56:37 GMT
- Title: How Much is Unseen Depends Chiefly on Information About the Seen
- Authors: Seongmin Lee, Marcel Böhme,
- Abstract summary: We find that in expectation the missing mass is entirely determined by the number $f_k$ of classes that do appear in the training data.<n>While this is the first precise characterization of the expected missing mass in terms of the sample, the induced estimator suffers from an impractically high variance.
- Score: 14.365105289625399
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The missing mass refers to the proportion of data points in an unknown population of classifier inputs that belong to classes not present in the classifier's training data, which is assumed to be a random sample from that unknown population. We find that in expectation the missing mass is entirely determined by the number $f_k$ of classes that do appear in the training data the same number of times and an exponentially decaying error. While this is the first precise characterization of the expected missing mass in terms of the sample, the induced estimator suffers from an impractically high variance. However, our theory suggests a large search space of nearly unbiased estimators that can be searched effectively and efficiently. Hence, we cast distribution-free estimation as an optimization problem to find a distribution-specific estimator with a minimized mean-squared error (MSE), given only the sample. In our experiments, our search algorithm discovers estimators that have a substantially smaller MSE than the state-of-the-art Good-Turing estimator. This holds for over 93% of runs when there are at least as many samples as classes. Our estimators' MSE is roughly 80% of the Good-Turing estimator's.
Related papers
- Model Evaluation in the Dark: Robust Classifier Metrics with Missing Labels [2.384873896423002]
We propose a multiple imputation technique for evaluating classifiers using metrics such as precision, recall, and ROC-AUC.
We empirically show that the predictive distribution's location and shape are generally correct, even in the Missing Not At Random regime.
arXiv Detail & Related papers (2025-04-25T14:31:42Z) - Estimating Uncertainty with Implicit Quantile Network [0.0]
Uncertainty quantification is an important part of many performance critical applications.
This paper provides a simple alternative to existing approaches such as ensemble learning and bayesian neural networks.
arXiv Detail & Related papers (2024-08-26T13:33:14Z) - On Efficient and Statistical Quality Estimation for Data Annotation [11.216738303463751]
Annotated datasets are an essential ingredient to train, evaluate, compare and productionalize supervised machine learning models.
Quality estimation is often performed by having experts manually label instances as correct or incorrect.
Basing estimates on small sample sizes, however, can lead to imprecise values for the error rate.
We show that acceptance sampling can reduce the required sample sizes up to 50% while providing the same statistical guarantees.
arXiv Detail & Related papers (2024-05-20T09:57:29Z) - Probabilistic Contrastive Learning for Long-Tailed Visual Recognition [78.70453964041718]
Longtailed distributions frequently emerge in real-world data, where a large number of minority categories contain a limited number of samples.
Recent investigations have revealed that supervised contrastive learning exhibits promising potential in alleviating the data imbalance.
We propose a novel probabilistic contrastive (ProCo) learning algorithm that estimates the data distribution of the samples from each class in the feature space.
arXiv Detail & Related papers (2024-03-11T13:44:49Z) - Collaborative Learning with Different Labeling Functions [7.228285747845779]
We study a variant of Collaborative PAC Learning, in which we aim to learn an accurate classifier for each of the $n$ data distributions.
We show that, when the data distributions satisfy a weaker realizability assumption, sample-efficient learning is still feasible.
arXiv Detail & Related papers (2024-02-16T04:32:22Z) - Nearest Neighbour Score Estimators for Diffusion Generative Models [16.189734871742743]
We introduce a novel nearest neighbour score function estimator which utilizes multiple samples from the training set to dramatically decrease estimator variance.
In diffusion models, we show that our estimator can replace a learned network for probability-flow ODE integration, opening promising new avenues of future research.
arXiv Detail & Related papers (2024-02-12T19:27:30Z) - Detecting Adversarial Data by Probing Multiple Perturbations Using
Expected Perturbation Score [62.54911162109439]
Adversarial detection aims to determine whether a given sample is an adversarial one based on the discrepancy between natural and adversarial distributions.
We propose a new statistic called expected perturbation score (EPS), which is essentially the expected score of a sample after various perturbations.
We develop EPS-based maximum mean discrepancy (MMD) as a metric to measure the discrepancy between the test sample and natural samples.
arXiv Detail & Related papers (2023-05-25T13:14:58Z) - Robust Sparse Mean Estimation via Incremental Learning [15.536082641659423]
In this paper, we study the problem of robust mean estimation, where the goal is to estimate a $k$-sparse mean from a collection of partially corrupted samples.
We present a simple mean estimator that overcomes both challenges under moderate conditions.
Our method does not need any prior knowledge of the sparsity level $k$.
arXiv Detail & Related papers (2023-05-24T16:02:28Z) - A Statistical Model for Predicting Generalization in Few-Shot
Classification [6.158812834002346]
We introduce a Gaussian model of the feature distribution to predict the generalization error.
We show that our approach outperforms alternatives such as the leave-one-out cross-validation strategy.
arXiv Detail & Related papers (2022-12-13T10:21:15Z) - The Optimal Noise in Noise-Contrastive Learning Is Not What You Think [80.07065346699005]
We show that deviating from this assumption can actually lead to better statistical estimators.
In particular, the optimal noise distribution is different from the data's and even from a different family.
arXiv Detail & Related papers (2022-03-02T13:59:20Z) - Learning to Estimate Without Bias [57.82628598276623]
Gauss theorem states that the weighted least squares estimator is a linear minimum variance unbiased estimation (MVUE) in linear models.
In this paper, we take a first step towards extending this result to non linear settings via deep learning with bias constraints.
A second motivation to BCE is in applications where multiple estimates of the same unknown are averaged for improved performance.
arXiv Detail & Related papers (2021-10-24T10:23:51Z) - Sampling from Arbitrary Functions via PSD Models [55.41644538483948]
We take a two-step approach by first modeling the probability distribution and then sampling from that model.
We show that these models can approximate a large class of densities concisely using few evaluations, and present a simple algorithm to effectively sample from these models.
arXiv Detail & Related papers (2021-10-20T12:25:22Z) - Unrolling Particles: Unsupervised Learning of Sampling Distributions [102.72972137287728]
Particle filtering is used to compute good nonlinear estimates of complex systems.
We show in simulations that the resulting particle filter yields good estimates in a wide range of scenarios.
arXiv Detail & Related papers (2021-10-06T16:58:34Z) - SLOE: A Faster Method for Statistical Inference in High-Dimensional
Logistic Regression [68.66245730450915]
We develop an improved method for debiasing predictions and estimating frequentist uncertainty for practical datasets.
Our main contribution is SLOE, an estimator of the signal strength with convergence guarantees that reduces the computation time of estimation and inference by orders of magnitude.
arXiv Detail & Related papers (2021-03-23T17:48:56Z) - DEMI: Discriminative Estimator of Mutual Information [5.248805627195347]
Estimating mutual information between continuous random variables is often intractable and challenging for high-dimensional data.
Recent progress has leveraged neural networks to optimize variational lower bounds on mutual information.
Our approach is based on training a classifier that provides the probability that a data sample pair is drawn from the joint distribution.
arXiv Detail & Related papers (2020-10-05T04:19:27Z) - $\gamma$-ABC: Outlier-Robust Approximate Bayesian Computation Based on a
Robust Divergence Estimator [95.71091446753414]
We propose to use a nearest-neighbor-based $gamma$-divergence estimator as a data discrepancy measure.
Our method achieves significantly higher robustness than existing discrepancy measures.
arXiv Detail & Related papers (2020-06-13T06:09:27Z) - Instability, Computational Efficiency and Statistical Accuracy [101.32305022521024]
We develop a framework that yields statistical accuracy based on interplay between the deterministic convergence rate of the algorithm at the population level, and its degree of (instability) when applied to an empirical object based on $n$ samples.
We provide applications of our general results to several concrete classes of models, including Gaussian mixture estimation, non-linear regression models, and informative non-response models.
arXiv Detail & Related papers (2020-05-22T22:30:52Z) - Nonparametric Estimation of the Fisher Information and Its Applications [82.00720226775964]
This paper considers the problem of estimation of the Fisher information for location from a random sample of size $n$.
An estimator proposed by Bhattacharya is revisited and improved convergence rates are derived.
A new estimator, termed a clipped estimator, is proposed.
arXiv Detail & Related papers (2020-05-07T17:21:56Z) - Quantifying With Only Positive Training Data [0.5735035463793008]
Quantification is the research field that studies methods for counting the number of data points that belong to each class in an unlabeled sample.
This article closes the gap between Positive and Unlabeled Learning (PUL) and One-class Quantification (OCQ)
We compare our method, Passive Aggressive Threshold (PAT), against PUL methods and show that PAT generally is the fastest and most accurate algorithm.
arXiv Detail & Related papers (2020-04-22T01:18:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.