Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties
- URL: http://arxiv.org/abs/2201.07932v1
- Date: Wed, 15 Dec 2021 18:56:39 GMT
- Title: Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties
- Authors: Mohamed S. Kraiem, Fernando S\'anchez-Hern\'andez and Mar\'ia N.
Moreno-Garc\'ia
- Abstract summary: In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class.
This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples.
Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
- Score: 62.997667081978825
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In many application domains such as medicine, information retrieval,
cybersecurity, social media, etc., datasets used for inducing classification
models often have an unequal distribution of the instances of each class. This
situation, known as imbalanced data classification, causes low predictive
performance for the minority class examples. Thus, the prediction model is
unreliable although the overall model accuracy can be acceptable. Oversampling
and undersampling techniques are well-known strategies to deal with this
problem by balancing the number of examples of each class. However, their
effectiveness depends on several factors mainly related to data intrinsic
characteristics, such as imbalance ratio, dataset size and dimensionality,
overlapping between classes or borderline examples. In this work, the impact of
these factors is analyzed through a comprehensive comparative study involving
40 datasets from different application areas. The objective is to obtain models
for automatic selection of the best resampling strategy for any dataset based
on its characteristics. These models allow us to check several factors
simultaneously considering a wide range of values since they are induced from
very varied datasets that cover a broad spectrum of conditions. This differs
from most studies that focus on the individual analysis of the characteristics
or cover a small range of values. In addition, the study encompasses both basic
and advanced resampling strategies that are evaluated by means of eight
different performance metrics, including new measures specifically designed for
imbalanced data classification. The general nature of the proposal allows the
choice of the most appropriate method regardless of the domain, avoiding the
search for special purpose techniques that could be valid for the target data.
Related papers
- A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance.
Data selection has shown promise in identifying the most representative samples from the entire dataset.
We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z) - Harnessing Diversity for Important Data Selection in Pretraining Large Language Models [39.89232835928945]
textttQuad considers both quality and diversity by using data influence to achieve state-of-the-art pre-training results.
For the diversity, textttQuad clusters the dataset into similar data instances within each cluster and diverse instances across different clusters.
arXiv Detail & Related papers (2024-09-25T14:49:29Z) - Fair Overlap Number of Balls (Fair-ONB): A Data-Morphology-based Undersampling Method for Bias Reduction [8.691440960669649]
One of the key issues regarding classification problems in Trustworthy Artificial Intelligence is ensuring Fairness in the prediction of different classes.
Data quality is critical in these cases, as biases in training data can be reflected in machine learning, impacting human lives and failing to comply with current regulations.
This work proposes Fair Overlap Number of Balls (Fair-ONB), an undersampling method that harnesses the data morphology of the different data groups to perform guided undersampling in overlap areas.
arXiv Detail & Related papers (2024-07-19T11:16:02Z) - Tackling Diverse Minorities in Imbalanced Classification [80.78227787608714]
Imbalanced datasets are commonly observed in various real-world applications, presenting significant challenges in training classifiers.
We propose generating synthetic samples iteratively by mixing data samples from both minority and majority classes.
We demonstrate the effectiveness of our proposed framework through extensive experiments conducted on seven publicly available benchmark datasets.
arXiv Detail & Related papers (2023-08-28T18:48:34Z) - Automatic Generation of Attention Rules For Containment of Machine
Learning Model Errors [1.4987559345379062]
We present several algorithms (strategies') for determining optimal rules to separate observations.
In particular, we prefer strategies that use feature-based slicing because they are human-interpretable, model-agnostic, and require minimal supplementary inputs or knowledge.
To evaluate strategies, we introduce metrics to measure various desired qualities, such as their performance, stability, and generalizability to unseen data.
arXiv Detail & Related papers (2023-05-14T10:15:35Z) - An Additive Instance-Wise Approach to Multi-class Model Interpretation [53.87578024052922]
Interpretable machine learning offers insights into what factors drive a certain prediction of a black-box system.
Existing methods mainly focus on selecting explanatory input features, which follow either locally additive or instance-wise approaches.
This work exploits the strengths of both methods and proposes a global framework for learning local explanations simultaneously for multiple target classes.
arXiv Detail & Related papers (2022-07-07T06:50:27Z) - Determination of class-specific variables in nonparametric
multiple-class classification [0.0]
We propose a probability-based nonparametric multiple-class classification method, and integrate it with the ability of identifying high impact variables for individual class.
We report the properties of the proposed method, and use both synthesized and real data sets to illustrate its properties under different classification situations.
arXiv Detail & Related papers (2022-05-07T10:08:58Z) - Equivariance Allows Handling Multiple Nuisance Variables When Analyzing
Pooled Neuroimaging Datasets [53.34152466646884]
In this paper, we show how bringing recent results on equivariant representation learning instantiated on structured spaces together with simple use of classical results on causal inference provides an effective practical solution.
We demonstrate how our model allows dealing with more than one nuisance variable under some assumptions and can enable analysis of pooled scientific datasets in scenarios that would otherwise entail removing a large portion of the samples.
arXiv Detail & Related papers (2022-03-29T04:54:06Z) - Dynamic Instance-Wise Classification in Correlated Feature Spaces [15.351282873821935]
In a typical machine learning setting, the predictions on all test instances are based on a common subset of features discovered during model training.
A new method is proposed that sequentially selects the best feature to evaluate for each test instance individually, and stops the selection process to make a prediction once it determines that no further improvement can be achieved with respect to classification accuracy.
The effectiveness, generalizability, and scalability of the proposed method is illustrated on a variety of real-world datasets from diverse application domains.
arXiv Detail & Related papers (2021-06-08T20:20:36Z) - Characterizing Fairness Over the Set of Good Models Under Selective
Labels [69.64662540443162]
We develop a framework for characterizing predictive fairness properties over the set of models that deliver similar overall performance.
We provide tractable algorithms to compute the range of attainable group-level predictive disparities.
We extend our framework to address the empirically relevant challenge of selectively labelled data.
arXiv Detail & Related papers (2021-01-02T02:11:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.