Sampling To Improve Predictions For Underrepresented Observations In
Imbalanced Data
- URL: http://arxiv.org/abs/2111.09065v2
- Date: Thu, 18 Nov 2021 16:25:28 GMT
- Title: Sampling To Improve Predictions For Underrepresented Observations In
Imbalanced Data
- Authors: Rune D. Kj{\ae}rsgaard, Manja G. Gr{\o}nberg, Line K. H. Clemmensen
- Abstract summary: Data imbalance negatively impacts the predictive performance of models on underrepresented observations.
We propose sampling to adjust for this imbalance with the goal of improving the performance of models trained on historical production data.
We apply our methods on a large biopharmaceutical manufacturing data set from an advanced simulation of penicillin production.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data imbalance is common in production data, where controlled production
settings require data to fall within a narrow range of variation and data are
collected with quality assessment in mind, rather than data analytic insights.
This imbalance negatively impacts the predictive performance of models on
underrepresented observations. We propose sampling to adjust for this imbalance
with the goal of improving the performance of models trained on historical
production data. We investigate the use of three sampling approaches to adjust
for imbalance. The goal is to downsample the covariates in the training data
and subsequently fit a regression model. We investigate how the predictive
power of the model changes when using either the sampled or the original data
for training. We apply our methods on a large biopharmaceutical manufacturing
data set from an advanced simulation of penicillin production and find that
fitting a model using the sampled data gives a small reduction in the overall
predictive performance, but yields a systematically better performance on
underrepresented observations. In addition, the results emphasize the need for
alternative, fair, and balanced model evaluations.
Related papers
- On conditional diffusion models for PDE simulations [53.01911265639582]
We study score-based diffusion models for forecasting and assimilation of sparse observations.
We propose an autoregressive sampling approach that significantly improves performance in forecasting.
We also propose a new training strategy for conditional score-based models that achieves stable performance over a range of history lengths.
arXiv Detail & Related papers (2024-10-21T18:31:04Z) - Learning Augmentation Policies from A Model Zoo for Time Series Forecasting [58.66211334969299]
We introduce AutoTSAug, a learnable data augmentation method based on reinforcement learning.
By augmenting the marginal samples with a learnable policy, AutoTSAug substantially improves forecasting performance.
arXiv Detail & Related papers (2024-09-10T07:34:19Z) - Outlier Gradient Analysis: Efficiently Identifying Detrimental Training Samples for Deep Learning Models [36.05242956018461]
In this paper, we establish a bridge between identifying detrimental training samples via influence functions and outlier gradient detection.
We first validate the hypothesis of our proposed outlier gradient analysis approach on synthetic datasets.
We then demonstrate its effectiveness in detecting mislabeled samples in vision models and selecting data samples for improving performance of natural language processing transformer models.
arXiv Detail & Related papers (2024-05-06T21:34:46Z) - Distilled Datamodel with Reverse Gradient Matching [74.75248610868685]
We introduce an efficient framework for assessing data impact, comprising offline training and online evaluation stages.
Our proposed method achieves comparable model behavior evaluation while significantly speeding up the process compared to the direct retraining method.
arXiv Detail & Related papers (2024-04-22T09:16:14Z) - An Experimental Study on the Rashomon Effect of Balancing Methods in Imbalanced Classification [0.0]
This paper examines the impact of balancing methods on predictive multiplicity using the Rashomon effect.
It is crucial because the blind model selection in data-centric AI is risky from a set of approximately equally accurate models.
arXiv Detail & Related papers (2024-03-22T13:08:22Z) - TRIAGE: Characterizing and auditing training data for improved
regression [80.11415390605215]
We introduce TRIAGE, a novel data characterization framework tailored to regression tasks and compatible with a broad class of regressors.
TRIAGE utilizes conformal predictive distributions to provide a model-agnostic scoring method, the TRIAGE score.
We show that TRIAGE's characterization is consistent and highlight its utility to improve performance via data sculpting/filtering, in multiple regression settings.
arXiv Detail & Related papers (2023-10-29T10:31:59Z) - The Effect of Balancing Methods on Model Behavior in Imbalanced
Classification Problems [4.370097023410272]
Imbalanced data poses a challenge in classification as model performance is affected by insufficient learning from minority classes.
This study addresses a more challenging aspect of balancing methods - their impact on model behavior.
To capture these changes, Explainable Artificial Intelligence tools are used to compare models trained on datasets before and after balancing.
arXiv Detail & Related papers (2023-06-30T22:25:01Z) - Rethinking InfoNCE: How Many Negative Samples Do You Need? [54.146208195806636]
We study how many negative samples are optimal for InfoNCE in different scenarios via a semi-quantitative theoretical framework.
We estimate the optimal negative sampling ratio using the $K$ value that maximizes the training effectiveness function.
arXiv Detail & Related papers (2021-05-27T08:38:29Z) - Balance-Subsampled Stable Prediction [55.13512328954456]
We propose a novel balance-subsampled stable prediction (BSSP) algorithm based on the theory of fractional factorial design.
A design-theoretic analysis shows that the proposed method can reduce the confounding effects among predictors induced by the distribution shift.
Numerical experiments on both synthetic and real-world data sets demonstrate that our BSSP algorithm significantly outperforms the baseline methods for stable prediction across unknown test data.
arXiv Detail & Related papers (2020-06-08T07:01:38Z) - Symbolic Regression Driven by Training Data and Prior Knowledge [0.0]
In symbolic regression, the search for analytic models is driven purely by the prediction error observed on the training data samples.
We propose a multi-objective symbolic regression approach that is driven by both the training data and the prior knowledge of the properties the desired model should manifest.
arXiv Detail & Related papers (2020-04-24T19:15:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.