Related papers: Restoring balance: principled under/oversampling of data for optimal classification

Restoring balance: principled under/oversampling of data for optimal classification

URL: http://arxiv.org/abs/2405.09535v1
Date: Wed, 15 May 2024 17:45:34 GMT
Title: Restoring balance: principled under/oversampling of data for optimal classification
Authors: Emanuele Loffredo, Mauro Pastore, Simona Cocco, Rémi Monasson,
Abstract summary: Class imbalance in real-world data poses a common bottleneck for machine learning tasks. Mitigation strategies, such as under or oversampling the data depending on their abundances, are routinely proposed and tested empirically. We provide a sharp prediction of the effects of under/oversampling strategies depending on class imbalance, first and second moments of the data, and the metrics of performance considered.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Class imbalance in real-world data poses a common bottleneck for machine learning tasks, since achieving good generalization on under-represented examples is often challenging. Mitigation strategies, such as under or oversampling the data depending on their abundances, are routinely proposed and tested empirically, but how they should adapt to the data statistics remains poorly understood. In this work, we determine exact analytical expressions of the generalization curves in the high-dimensional regime for linear classifiers (Support Vector Machines). We also provide a sharp prediction of the effects of under/oversampling strategies depending on class imbalance, first and second moments of the data, and the metrics of performance considered. We show that mixed strategies involving under and oversampling of data lead to performance improvement. Through numerical experiments, we show the relevance of our theoretical predictions on real datasets, on deeper architectures and with sampling strategies based on unsupervised probabilistic models.

Related papers

From Linear to Nonlinear: Provable Weak-to-Strong Generalization through Feature Learning [27.3606707777401]
We provide a formal analysis of weak-to-strong generalization from a linear CNN (weak) to a two-layer ReLU CNN (strong)<n>Our analysis identifies two regimes -- data-scarce and data-abundant -- based on the signal-to-noise characteristics of the dataset.
arXiv Detail & Related papers (2025-10-28T07:53:24Z)
Optimal Regularization for Performative Learning [29.2228276896028]
We show how regularization can help cope with performative effects by studying its impact in high-dimensional ridge regression.<n>We show that the optimal regularization scales with the overall strength of the performative effect, making it possible to set the regularization in anticipation of this effect.
arXiv Detail & Related papers (2025-10-14T08:00:08Z)
A Statistical Theory of Contrastive Learning via Approximate Sufficient Statistics [19.24473530318175]
We develop a new theoretical framework for analyzing data augmentation-based contrastive learning. We show that minimizing SimCLR and other contrastive losses yields encoders that are approximately sufficient.
arXiv Detail & Related papers (2025-03-21T21:07:18Z)
Dynamic Loss-Based Sample Reweighting for Improved Large Language Model Pretraining [55.262510814326035]
Existing reweighting strategies primarily focus on group-level data importance. We introduce novel algorithms for dynamic, instance-level data reweighting. Our framework allows us to devise reweighting strategies deprioritizing redundant or uninformative data.
arXiv Detail & Related papers (2025-02-10T17:57:15Z)
Histogram Approaches for Imbalanced Data Streams Regression [1.8385275253826225]
Imbalanced domains pose a significant challenge in real-world predictive analytics, particularly in the context of regression. This study introduces histogram-based sampling strategies to overcome this constraint. Comprehensive experiments on synthetic and real-world benchmarks demonstrate that HistUS and HistOS substantially improve rare-case prediction accuracy.
arXiv Detail & Related papers (2025-01-29T11:03:02Z)
DRoP: Distributionally Robust Pruning [11.930434318557156]
We conduct the first systematic study of the impact of data pruning on classification bias of trained models. We propose DRoP, a distributionally robust approach to pruning and empirically demonstrate its performance on standard computer vision benchmarks.
arXiv Detail & Related papers (2024-04-08T14:55:35Z)
TRIAGE: Characterizing and auditing training data for improved regression [80.11415390605215]
We introduce TRIAGE, a novel data characterization framework tailored to regression tasks and compatible with a broad class of regressors. TRIAGE utilizes conformal predictive distributions to provide a model-agnostic scoring method, the TRIAGE score. We show that TRIAGE's characterization is consistent and highlight its utility to improve performance via data sculpting/filtering, in multiple regression settings.
arXiv Detail & Related papers (2023-10-29T10:31:59Z)
Boosting Differentiable Causal Discovery via Adaptive Sample Reweighting [62.23057729112182]
Differentiable score-based causal discovery methods learn a directed acyclic graph from observational data. We propose a model-agnostic framework to boost causal discovery performance by dynamically learning the adaptive weights for the Reweighted Score function, ReScore.
arXiv Detail & Related papers (2023-03-06T14:49:59Z)
Systematic Evaluation of Predictive Fairness [60.0947291284978]
Mitigating bias in training on biased datasets is an important open problem. We examine the performance of various debiasing methods across multiple tasks. We find that data conditions have a strong influence on relative model performance.
arXiv Detail & Related papers (2022-10-17T05:40:13Z)
Bias-inducing geometries: an exactly solvable data model with fairness implications [13.690313475721094]
We introduce an exactly solvable high-dimensional model of data imbalance. We analytically unpack the typical properties of learning models trained in this synthetic framework. We obtain exact predictions for the observables that are commonly employed for fairness assessment.
arXiv Detail & Related papers (2022-05-31T16:27:57Z)
Generalizable Information Theoretic Causal Representation [37.54158138447033]
We propose to learn causal representation from observational data by regularizing the learning procedure with mutual information measures according to our hypothetical causal graph. The optimization involves a counterfactual loss, based on which we deduce a theoretical guarantee that the causality-inspired learning is with reduced sample complexity and better generalization ability.
arXiv Detail & Related papers (2022-02-17T00:38:35Z)
Selecting the suitable resampling strategy for imbalanced data classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class. This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples. Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z)
Improved Fine-tuning by Leveraging Pre-training Data: Theory and Practice [52.11183787786718]
Fine-tuning a pre-trained model on the target data is widely used in many deep learning applications. Recent studies have empirically shown that training from scratch has the final performance that is no worse than this pre-training strategy. We propose a novel selection strategy to select a subset from pre-training data to help improve the generalization on the target task.
arXiv Detail & Related papers (2021-11-24T06:18:32Z)
Linear Regression with Distributed Learning: A Generalization Error Perspective [0.0]
We investigate the performance of distributed learning for large-scale linear regression. We focus on the generalization error, i.e., the performance on unseen data. Our results show that the generalization error of the distributed solution can be substantially higher than that of the centralized solution.
arXiv Detail & Related papers (2021-01-22T08:43:28Z)
On the Benefits of Invariance in Neural Networks [56.362579457990094]
We show that training with data augmentation leads to better estimates of risk and thereof gradients, and we provide a PAC-Bayes generalization bound for models trained with data augmentation. We also show that compared to data augmentation, feature averaging reduces generalization error when used with convex losses, and tightens PAC-Bayes bounds.
arXiv Detail & Related papers (2020-05-01T02:08:58Z)
Learning Unbiased Representations via Mutual Information Backpropagation [36.383338079229695]
In particular, we face the case where some attributes (bias) of the data, if learned by the model, can severely compromise its generalization properties. We propose a novel end-to-end optimization strategy, which simultaneously estimates and minimizes the mutual information between the learned representation and the data attributes.
arXiv Detail & Related papers (2020-03-13T18:06:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.