Nonuniform Negative Sampling and Log Odds Correction with Rare Events
Data
- URL: http://arxiv.org/abs/2110.13048v1
- Date: Mon, 25 Oct 2021 15:37:22 GMT
- Title: Nonuniform Negative Sampling and Log Odds Correction with Rare Events
Data
- Authors: HaiYing Wang, Aonan Zhang, Chong Wang
- Abstract summary: We investigate the issue of parameter estimation with nonuniform negative sampling for imbalanced data.
We derive a general inverse probability weighted (IPW) estimator and obtain the optimal sampling probability that minimizes its variance.
Both theoretical and empirical results demonstrate the effectiveness of our method.
- Score: 15.696653979226113
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We investigate the issue of parameter estimation with nonuniform negative
sampling for imbalanced data. We first prove that, with imbalanced data, the
available information about unknown parameters is only tied to the relatively
small number of positive instances, which justifies the usage of negative
sampling. However, if the negative instances are subsampled to the same level
of the positive cases, there is information loss. To maintain more information,
we derive the asymptotic distribution of a general inverse probability weighted
(IPW) estimator and obtain the optimal sampling probability that minimizes its
variance. To further improve the estimation efficiency over the IPW method, we
propose a likelihood-based estimator by correcting log odds for the sampled
data and prove that the improved estimator has the smallest asymptotic variance
among a large class of estimators. It is also more robust to pilot
misspecification. We validate our approach on simulated data as well as a real
click-through rate dataset with more than 0.3 trillion instances, collected
over a period of a month. Both theoretical and empirical results demonstrate
the effectiveness of our method.
Related papers
- Efficient semi-supervised inference for logistic regression under
case-control studies [3.5485531932219243]
We consider an inference problem in semi-supervised settings where the outcome in the labeled data is binary.
Case-control sampling is an effective sampling scheme for alleviating imbalance structure in binary data.
We find out that with the availability of the unlabeled data, the intercept parameter can be identified in semi-supervised learning setting.
arXiv Detail & Related papers (2024-02-23T14:55:58Z) - Detecting Adversarial Data by Probing Multiple Perturbations Using
Expected Perturbation Score [62.54911162109439]
Adversarial detection aims to determine whether a given sample is an adversarial one based on the discrepancy between natural and adversarial distributions.
We propose a new statistic called expected perturbation score (EPS), which is essentially the expected score of a sample after various perturbations.
We develop EPS-based maximum mean discrepancy (MMD) as a metric to measure the discrepancy between the test sample and natural samples.
arXiv Detail & Related papers (2023-05-25T13:14:58Z) - Rethinking Collaborative Metric Learning: Toward an Efficient
Alternative without Negative Sampling [156.7248383178991]
Collaborative Metric Learning (CML) paradigm has aroused wide interest in the area of recommendation systems (RS)
We find that negative sampling would lead to a biased estimation of the generalization error.
Motivated by this, we propose an efficient alternative without negative sampling for CML named textitSampling-Free Collaborative Metric Learning (SFCML)
arXiv Detail & Related papers (2022-06-23T08:50:22Z) - Near-optimal inference in adaptive linear regression [60.08422051718195]
Even simple methods like least squares can exhibit non-normal behavior when data is collected in an adaptive manner.
We propose a family of online debiasing estimators to correct these distributional anomalies in at least squares estimation.
We demonstrate the usefulness of our theory via applications to multi-armed bandit, autoregressive time series estimation, and active learning with exploration.
arXiv Detail & Related papers (2021-07-05T21:05:11Z) - Rethinking InfoNCE: How Many Negative Samples Do You Need? [54.146208195806636]
We study how many negative samples are optimal for InfoNCE in different scenarios via a semi-quantitative theoretical framework.
We estimate the optimal negative sampling ratio using the $K$ value that maximizes the training effectiveness function.
arXiv Detail & Related papers (2021-05-27T08:38:29Z) - Maximum sampled conditional likelihood for informative subsampling [4.708378681950648]
Subsampling is a computationally effective approach to extract information from massive data sets when computing resources are limited.
We propose to use the maximum maximum conditional likelihood estimator (MSCLE) based on the sampled data.
arXiv Detail & Related papers (2020-11-11T16:01:17Z) - DEMI: Discriminative Estimator of Mutual Information [5.248805627195347]
Estimating mutual information between continuous random variables is often intractable and challenging for high-dimensional data.
Recent progress has leveraged neural networks to optimize variational lower bounds on mutual information.
Our approach is based on training a classifier that provides the probability that a data sample pair is drawn from the joint distribution.
arXiv Detail & Related papers (2020-10-05T04:19:27Z) - Evaluating representations by the complexity of learning low-loss
predictors [55.94170724668857]
We consider the problem of evaluating representations of data for use in solving a downstream task.
We propose to measure the quality of a representation by the complexity of learning a predictor on top of the representation that achieves low loss on a task of interest.
arXiv Detail & Related papers (2020-09-15T22:06:58Z) - Estimating Gradients for Discrete Random Variables by Sampling without
Replacement [93.09326095997336]
We derive an unbiased estimator for expectations over discrete random variables based on sampling without replacement.
We show that our estimator can be derived as the Rao-Blackwellization of three different estimators.
arXiv Detail & Related papers (2020-02-14T14:15:18Z) - Unbiased and Efficient Log-Likelihood Estimation with Inverse Binomial
Sampling [9.66840768820136]
inverse binomial sampling (IBS) can estimate the log-likelihood of an entire data set efficiently and without bias.
IBS produces lower error in the estimated parameters and maximum log-likelihood values than alternative sampling methods.
arXiv Detail & Related papers (2020-01-12T19:51:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.