Optimal Downsampling for Imbalanced Classification with Generalized Linear Models
- URL: http://arxiv.org/abs/2410.08994v1
- Date: Fri, 11 Oct 2024 17:08:13 GMT
- Title: Optimal Downsampling for Imbalanced Classification with Generalized Linear Models
- Authors: Yan Chen, Jose Blanchet, Krzysztof Dembczynski, Laura Fee Nern, Aaron Flores,
- Abstract summary: We study optimal downsampling for imbalanced classification using generalized linear models (GLMs)
We propose a pseudo likelihood estimator and study its normality in the context of increasingly imbalanced populations.
- Score: 6.14486033794703
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Downsampling or under-sampling is a technique that is utilized in the context of large and highly imbalanced classification models. We study optimal downsampling for imbalanced classification using generalized linear models (GLMs). We propose a pseudo maximum likelihood estimator and study its asymptotic normality in the context of increasingly imbalanced populations relative to an increasingly large sample size. We provide theoretical guarantees for the introduced estimator. Additionally, we compute the optimal downsampling rate using a criterion that balances statistical accuracy and computational efficiency. Our numerical experiments, conducted on both synthetic and empirical data, further validate our theoretical results, and demonstrate that the introduced estimator outperforms commonly available alternatives.
Related papers
- An optimal transport approach for selecting a representative subsample
with application in efficient kernel density estimation [21.632131776088084]
Subsampling methods aim to select a subsample as a surrogate for the observed sample.
Existing model-free subsampling methods are usually built upon clustering techniques or kernel tricks.
We propose a novel model-free subsampling method by utilizing optimal transport techniques.
arXiv Detail & Related papers (2022-05-31T05:19:29Z) - Imbalanced Classification via a Tabular Translation GAN [4.864819846886142]
We present a model based on Generative Adversarial Networks which uses additional regularization losses to map majority samples to corresponding synthetic minority samples.
We show that the proposed method improves average precision when compared to alternative re-weighting and oversampling techniques.
arXiv Detail & Related papers (2022-04-19T06:02:53Z) - Heavy-tailed Streaming Statistical Estimation [58.70341336199497]
We consider the task of heavy-tailed statistical estimation given streaming $p$ samples.
We design a clipped gradient descent and provide an improved analysis under a more nuanced condition on the noise of gradients.
arXiv Detail & Related papers (2021-08-25T21:30:27Z) - Near-optimal inference in adaptive linear regression [60.08422051718195]
Even simple methods like least squares can exhibit non-normal behavior when data is collected in an adaptive manner.
We propose a family of online debiasing estimators to correct these distributional anomalies in at least squares estimation.
We demonstrate the usefulness of our theory via applications to multi-armed bandit, autoregressive time series estimation, and active learning with exploration.
arXiv Detail & Related papers (2021-07-05T21:05:11Z) - Maximum sampled conditional likelihood for informative subsampling [4.708378681950648]
Subsampling is a computationally effective approach to extract information from massive data sets when computing resources are limited.
We propose to use the maximum maximum conditional likelihood estimator (MSCLE) based on the sampled data.
arXiv Detail & Related papers (2020-11-11T16:01:17Z) - Why resampling outperforms reweighting for correcting sampling bias with
stochastic gradients [10.860844636412862]
Training machine learning models on biased data sets requires correction techniques to compensate for the bias.
We consider two commonly-used techniques, resampling and reweighting, that rebalance the proportions of the subgroups to maintain the desired objective function.
arXiv Detail & Related papers (2020-09-28T16:12:38Z) - Improving Maximum Likelihood Training for Text Generation with Density
Ratio Estimation [51.091890311312085]
We propose a new training scheme for auto-regressive sequence generative models, which is effective and stable when operating at large sample space encountered in text generation.
Our method stably outperforms Maximum Likelihood Estimation and other state-of-the-art sequence generative models in terms of both quality and diversity.
arXiv Detail & Related papers (2020-07-12T15:31:24Z) - Good Classifiers are Abundant in the Interpolating Regime [64.72044662855612]
We develop a methodology to compute precisely the full distribution of test errors among interpolating classifiers.
We find that test errors tend to concentrate around a small typical value $varepsilon*$, which deviates substantially from the test error of worst-case interpolating model.
Our results show that the usual style of analysis in statistical learning theory may not be fine-grained enough to capture the good generalization performance observed in practice.
arXiv Detail & Related papers (2020-06-22T21:12:31Z) - Detangling robustness in high dimensions: composite versus
model-averaged estimation [11.658462692891355]
Robust methods, though ubiquitous in practice, are yet to be fully understood in the context of regularized estimation and high dimensions.
This paper provides a toolbox to further study robustness in these settings and focuses on prediction.
arXiv Detail & Related papers (2020-06-12T20:40:15Z) - Compressing Large Sample Data for Discriminant Analysis [78.12073412066698]
We consider the computational issues due to large sample size within the discriminant analysis framework.
We propose a new compression approach for reducing the number of training samples for linear and quadratic discriminant analysis.
arXiv Detail & Related papers (2020-05-08T05:09:08Z) - Asymptotic Analysis of an Ensemble of Randomly Projected Linear
Discriminants [94.46276668068327]
In [1], an ensemble of randomly projected linear discriminants is used to classify datasets.
We develop a consistent estimator of the misclassification probability as an alternative to the computationally-costly cross-validation estimator.
We also demonstrate the use of our estimator for tuning the projection dimension on both real and synthetic data.
arXiv Detail & Related papers (2020-04-17T12:47:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.