Logistic Regression for Massive Data with Rare Events
- URL: http://arxiv.org/abs/2006.00683v1
- Date: Mon, 1 Jun 2020 03:09:49 GMT
- Title: Logistic Regression for Massive Data with Rare Events
- Authors: HaiYing Wang
- Abstract summary: This paper studies binary logistic regression for rare events data, or imbalanced data, where the number of events is significantly smaller than the number of nonevents.
We show that the available information in rare events data is at the scale of the number of events instead of the full data sample size.
- Score: 4.09920839425892
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper studies binary logistic regression for rare events data, or
imbalanced data, where the number of events (observations in one class, often
called cases) is significantly smaller than the number of nonevents
(observations in the other class, often called controls). We first derive the
asymptotic distribution of the maximum likelihood estimator (MLE) of the
unknown parameter, which shows that the asymptotic variance convergences to
zero in a rate of the inverse of the number of the events instead of the
inverse of the full data sample size. This indicates that the available
information in rare events data is at the scale of the number of events instead
of the full data sample size. Furthermore, we prove that under-sampling a small
proportion of the nonevents, the resulting under-sampled estimator may have
identical asymptotic distribution to the full data MLE. This demonstrates the
advantage of under-sampling nonevents for rare events data, because this
procedure may significantly reduce the computation and/or data collection
costs. Another common practice in analyzing rare events data is to over-sample
(replicate) the events, which has a higher computational cost. We show that
this procedure may even result in efficiency loss in terms of parameter
estimation.
Related papers
- Risk and cross validation in ridge regression with correlated samples [72.59731158970894]
We provide training examples for the in- and out-of-sample risks of ridge regression when the data points have arbitrary correlations.
We further extend our analysis to the case where the test point has non-trivial correlations with the training set, setting often encountered in time series forecasting.
We validate our theory across a variety of high dimensional data.
arXiv Detail & Related papers (2024-08-08T17:27:29Z) - Evaluating the Role of Data Enrichment Approaches Towards Rare Event Analysis in Manufacturing [1.3980986259786223]
Rare events are occurrences that take place with a significantly lower frequency than more common regular events.
In manufacturing, predicting such events is particularly important, as they lead to unplanned downtime, shortening equipment lifespan, and high energy consumption.
This paper evaluates the role of data enrichment techniques combined with supervised machine-learning techniques for rare event detection and prediction.
arXiv Detail & Related papers (2024-07-01T00:05:56Z) - Towards Dynamic Causal Discovery with Rare Events: A Nonparametric
Conditional Independence Test [4.67306371596399]
We introduce a novel statistical independence test on data collected from time-invariant systems in which rare but consequential events occur.
We provide non-asymptotic sample bounds for the consistency of our method, and validate its performance across various simulated and real-world datasets.
arXiv Detail & Related papers (2022-11-29T21:15:51Z) - Learning from aggregated data with a maximum entropy model [73.63512438583375]
We show how a new model, similar to a logistic regression, may be learned from aggregated data only by approximating the unobserved feature distribution with a maximum entropy hypothesis.
We present empirical evidence on several public datasets that the model learned this way can achieve performances comparable to those of a logistic model trained with the full unaggregated data.
arXiv Detail & Related papers (2022-10-05T09:17:27Z) - Equivariance Allows Handling Multiple Nuisance Variables When Analyzing
Pooled Neuroimaging Datasets [53.34152466646884]
In this paper, we show how bringing recent results on equivariant representation learning instantiated on structured spaces together with simple use of classical results on causal inference provides an effective practical solution.
We demonstrate how our model allows dealing with more than one nuisance variable under some assumptions and can enable analysis of pooled scientific datasets in scenarios that would otherwise entail removing a large portion of the samples.
arXiv Detail & Related papers (2022-03-29T04:54:06Z) - Combining Observational and Randomized Data for Estimating Heterogeneous
Treatment Effects [82.20189909620899]
Estimating heterogeneous treatment effects is an important problem across many domains.
Currently, most existing works rely exclusively on observational data.
We propose to estimate heterogeneous treatment effects by combining large amounts of observational data and small amounts of randomized data.
arXiv Detail & Related papers (2022-02-25T18:59:54Z) - Nonuniform Negative Sampling and Log Odds Correction with Rare Events
Data [15.696653979226113]
We investigate the issue of parameter estimation with nonuniform negative sampling for imbalanced data.
We derive a general inverse probability weighted (IPW) estimator and obtain the optimal sampling probability that minimizes its variance.
Both theoretical and empirical results demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2021-10-25T15:37:22Z) - RIFLE: Imputation and Robust Inference from Low Order Marginals [10.082738539201804]
We develop a statistical inference framework for regression and classification in the presence of missing data without imputation.
Our framework, RIFLE, estimates low-order moments of the underlying data distribution with corresponding confidence intervals to learn a distributionally robust model.
Our experiments demonstrate that RIFLE outperforms other benchmark algorithms when the percentage of missing values is high and/or when the number of data points is relatively small.
arXiv Detail & Related papers (2021-09-01T23:17:30Z) - SLOE: A Faster Method for Statistical Inference in High-Dimensional
Logistic Regression [68.66245730450915]
We develop an improved method for debiasing predictions and estimating frequentist uncertainty for practical datasets.
Our main contribution is SLOE, an estimator of the signal strength with convergence guarantees that reduces the computation time of estimation and inference by orders of magnitude.
arXiv Detail & Related papers (2021-03-23T17:48:56Z) - Efficient Causal Inference from Combined Observational and
Interventional Data through Causal Reductions [68.6505592770171]
Unobserved confounding is one of the main challenges when estimating causal effects.
We propose a novel causal reduction method that replaces an arbitrary number of possibly high-dimensional latent confounders.
We propose a learning algorithm to estimate the parameterized reduced model jointly from observational and interventional data.
arXiv Detail & Related papers (2021-03-08T14:29:07Z) - Unbiased and Efficient Log-Likelihood Estimation with Inverse Binomial
Sampling [9.66840768820136]
inverse binomial sampling (IBS) can estimate the log-likelihood of an entire data set efficiently and without bias.
IBS produces lower error in the estimated parameters and maximum log-likelihood values than alternative sampling methods.
arXiv Detail & Related papers (2020-01-12T19:51:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.