Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties
- URL: http://arxiv.org/abs/2201.07932v1
- Date: Wed, 15 Dec 2021 18:56:39 GMT
- Title: Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties
- Authors: Mohamed S. Kraiem, Fernando S\'anchez-Hern\'andez and Mar\'ia N.
Moreno-Garc\'ia
- Abstract summary: In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class.
This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples.
Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
- Score: 62.997667081978825
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In many application domains such as medicine, information retrieval,
cybersecurity, social media, etc., datasets used for inducing classification
models often have an unequal distribution of the instances of each class. This
situation, known as imbalanced data classification, causes low predictive
performance for the minority class examples. Thus, the prediction model is
unreliable although the overall model accuracy can be acceptable. Oversampling
and undersampling techniques are well-known strategies to deal with this
problem by balancing the number of examples of each class. However, their
effectiveness depends on several factors mainly related to data intrinsic
characteristics, such as imbalance ratio, dataset size and dimensionality,
overlapping between classes or borderline examples. In this work, the impact of
these factors is analyzed through a comprehensive comparative study involving
40 datasets from different application areas. The objective is to obtain models
for automatic selection of the best resampling strategy for any dataset based
on its characteristics. These models allow us to check several factors
simultaneously considering a wide range of values since they are induced from
very varied datasets that cover a broad spectrum of conditions. This differs
from most studies that focus on the individual analysis of the characteristics
or cover a small range of values. In addition, the study encompasses both basic
and advanced resampling strategies that are evaluated by means of eight
different performance metrics, including new measures specifically designed for
imbalanced data classification. The general nature of the proposal allows the
choice of the most appropriate method regardless of the domain, avoiding the
search for special purpose techniques that could be valid for the target data.
Related papers
- Tackling Diverse Minorities in Imbalanced Classification [80.78227787608714]
Imbalanced datasets are commonly observed in various real-world applications, presenting significant challenges in training classifiers.
We propose generating synthetic samples iteratively by mixing data samples from both minority and majority classes.
We demonstrate the effectiveness of our proposed framework through extensive experiments conducted on seven publicly available benchmark datasets.
arXiv Detail & Related papers (2023-08-28T18:48:34Z) - Automatic Generation of Attention Rules For Containment of Machine
Learning Model Errors [1.4987559345379062]
We present several algorithms (strategies') for determining optimal rules to separate observations.
In particular, we prefer strategies that use feature-based slicing because they are human-interpretable, model-agnostic, and require minimal supplementary inputs or knowledge.
To evaluate strategies, we introduce metrics to measure various desired qualities, such as their performance, stability, and generalizability to unseen data.
arXiv Detail & Related papers (2023-05-14T10:15:35Z) - Systematic Evaluation of Predictive Fairness [60.0947291284978]
Mitigating bias in training on biased datasets is an important open problem.
We examine the performance of various debiasing methods across multiple tasks.
We find that data conditions have a strong influence on relative model performance.
arXiv Detail & Related papers (2022-10-17T05:40:13Z) - An Additive Instance-Wise Approach to Multi-class Model Interpretation [53.87578024052922]
Interpretable machine learning offers insights into what factors drive a certain prediction of a black-box system.
Existing methods mainly focus on selecting explanatory input features, which follow either locally additive or instance-wise approaches.
This work exploits the strengths of both methods and proposes a global framework for learning local explanations simultaneously for multiple target classes.
arXiv Detail & Related papers (2022-07-07T06:50:27Z) - Determination of class-specific variables in nonparametric
multiple-class classification [0.0]
We propose a probability-based nonparametric multiple-class classification method, and integrate it with the ability of identifying high impact variables for individual class.
We report the properties of the proposed method, and use both synthesized and real data sets to illustrate its properties under different classification situations.
arXiv Detail & Related papers (2022-05-07T10:08:58Z) - Equivariance Allows Handling Multiple Nuisance Variables When Analyzing
Pooled Neuroimaging Datasets [53.34152466646884]
In this paper, we show how bringing recent results on equivariant representation learning instantiated on structured spaces together with simple use of classical results on causal inference provides an effective practical solution.
We demonstrate how our model allows dealing with more than one nuisance variable under some assumptions and can enable analysis of pooled scientific datasets in scenarios that would otherwise entail removing a large portion of the samples.
arXiv Detail & Related papers (2022-03-29T04:54:06Z) - Adaptive Sampling Strategies to Construct Equitable Training Datasets [0.7036032466145111]
In domains ranging from computer vision to natural language processing, machine learning models have been shown to exhibit stark disparities.
One factor contributing to these performance gaps is a lack of representation in the data the models are trained on.
We formalize the problem of creating equitable training datasets, and propose a statistical framework for addressing this problem.
arXiv Detail & Related papers (2022-01-31T19:19:30Z) - Dynamic Instance-Wise Classification in Correlated Feature Spaces [15.351282873821935]
In a typical machine learning setting, the predictions on all test instances are based on a common subset of features discovered during model training.
A new method is proposed that sequentially selects the best feature to evaluate for each test instance individually, and stops the selection process to make a prediction once it determines that no further improvement can be achieved with respect to classification accuracy.
The effectiveness, generalizability, and scalability of the proposed method is illustrated on a variety of real-world datasets from diverse application domains.
arXiv Detail & Related papers (2021-06-08T20:20:36Z) - Characterizing Fairness Over the Set of Good Models Under Selective
Labels [69.64662540443162]
We develop a framework for characterizing predictive fairness properties over the set of models that deliver similar overall performance.
We provide tractable algorithms to compute the range of attainable group-level predictive disparities.
We extend our framework to address the empirically relevant challenge of selectively labelled data.
arXiv Detail & Related papers (2021-01-02T02:11:37Z) - Discriminative, Generative and Self-Supervised Approaches for
Target-Agnostic Learning [8.666667951130892]
generative and self-supervised learning models are shown to perform well at the task.
Our derived theorem for the pseudo-likelihood theory also shows that they are related for inferring a joint distribution model.
arXiv Detail & Related papers (2020-11-12T15:03:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.