An Empirical Study on the Joint Impact of Feature Selection and Data
Resampling on Imbalance Classification
- URL: http://arxiv.org/abs/2109.00201v1
- Date: Wed, 1 Sep 2021 06:01:51 GMT
- Title: An Empirical Study on the Joint Impact of Feature Selection and Data
Resampling on Imbalance Classification
- Authors: Chongsheng Zhang, Paolo Soda, Jingjun Bi, Gaojuan Fan, George
Almpanidis, Salvador Garcia
- Abstract summary: This study focuses on the synergy between feature selection and data resampling for imbalance classification.
We conduct a large amount of experiments on 52 publicly available datasets, using 9 feature selection methods, 6 resampling approaches for class imbalance learning, and 3 well-known classification algorithms.
- Score: 4.506770920842088
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Real-world datasets often present different degrees of imbalanced (i.e.,
long-tailed or skewed) distributions. While the majority (a.k.a., head or
frequent) classes have sufficient samples, the minority (a.k.a., tail or rare)
classes can be under-represented by a rather limited number of samples. On one
hand, data resampling is a common approach to tackling class imbalance. On the
other hand, dimension reduction, which reduces the feature space, is a
conventional machine learning technique for building stronger classification
models on a dataset. However, the possible synergy between feature selection
and data resampling for high-performance imbalance classification has rarely
been investigated before. To address this issue, this paper carries out a
comprehensive empirical study on the joint influence of feature selection and
resampling on two-class imbalance classification. Specifically, we study the
performance of two opposite pipelines for imbalance classification, i.e.,
applying feature selection before or after data resampling. We conduct a large
amount of experiments (a total of 9225 experiments) on 52 publicly available
datasets, using 9 feature selection methods, 6 resampling approaches for class
imbalance learning, and 3 well-known classification algorithms. Experimental
results show that there is no constant winner between the two pipelines, thus
both of them should be considered to derive the best performing model for
imbalance classification. We also find that the performance of an imbalance
classification model depends on the classifier adopted, the ratio between the
number of majority and minority samples (IR), as well as on the ratio between
the number of samples and features (SFR). Overall, this study should provide
new reference value for researchers and practitioners in imbalance learning.
Related papers
- When resampling/reweighting improves feature learning in imbalanced classification?: A toy-model study [5.5730368125641405]
A toy model of binary classification is studied with the aim of clarifying the class-wise resampling/reweighting effect on the feature learning performance under the presence of class imbalance.
The result shows that there exists a case in which the no resampling/reweighting situation gives the best feature learning performance irrespective of the choice of losses or classifiers.
arXiv Detail & Related papers (2024-09-09T13:31:00Z) - Probabilistic Contrastive Learning for Long-Tailed Visual Recognition [78.70453964041718]
Longtailed distributions frequently emerge in real-world data, where a large number of minority categories contain a limited number of samples.
Recent investigations have revealed that supervised contrastive learning exhibits promising potential in alleviating the data imbalance.
We propose a novel probabilistic contrastive (ProCo) learning algorithm that estimates the data distribution of the samples from each class in the feature space.
arXiv Detail & Related papers (2024-03-11T13:44:49Z) - Balanced Classification: A Unified Framework for Long-Tailed Object
Detection [74.94216414011326]
Conventional detectors suffer from performance degradation when dealing with long-tailed data due to a classification bias towards the majority head categories.
We introduce a unified framework called BAlanced CLassification (BACL), which enables adaptive rectification of inequalities caused by disparities in category distribution.
BACL consistently achieves performance improvements across various datasets with different backbones and architectures.
arXiv Detail & Related papers (2023-08-04T09:11:07Z) - Delving into Semantic Scale Imbalance [45.30062061215943]
We define and quantify the semantic scale of classes, which is used to measure the feature diversity of classes.
We propose semantic-scale-balanced learning, including a general loss improvement scheme and a dynamic re-weighting training framework.
Comprehensive experiments show that dynamic semantic-scale-balanced learning consistently enables the model to perform superiorly on large-scale long-tailed and non-long-tailed natural and medical datasets.
arXiv Detail & Related papers (2022-12-30T09:40:09Z) - Systematic Evaluation of Predictive Fairness [60.0947291284978]
Mitigating bias in training on biased datasets is an important open problem.
We examine the performance of various debiasing methods across multiple tasks.
We find that data conditions have a strong influence on relative model performance.
arXiv Detail & Related papers (2022-10-17T05:40:13Z) - Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class.
This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples.
Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z) - Divide-and-Conquer Hard-thresholding Rules in High-dimensional
Imbalanced Classification [1.0312968200748118]
We study the impact of imbalance class sizes on the linear discriminant analysis (LDA) in high dimensions.
We show that due to data scarcity in one class, referred to as the minority class, the LDA ignores the minority class yielding a maximum misclassification rate.
We propose a new construction of a hard-conquering rule based on a divide-and-conquer technique that reduces the large difference between the misclassification rates.
arXiv Detail & Related papers (2021-11-05T07:44:28Z) - Statistical Theory for Imbalanced Binary Classification [8.93993657323783]
We show that optimal classification performance depends on certain properties of class imbalance that have not previously been formalized.
Specifically, we propose a novel sub-type of class imbalance, which we call Uniform Class Imbalance.
These results provide some of the first meaningful finite-sample statistical theory for imbalanced binary classification.
arXiv Detail & Related papers (2021-07-05T03:55:43Z) - Long-Tailed Recognition Using Class-Balanced Experts [128.73438243408393]
We propose an ensemble of class-balanced experts that combines the strength of diverse classifiers.
Our ensemble of class-balanced experts reaches results close to state-of-the-art and an extended ensemble establishes a new state-of-the-art on two benchmarks for long-tailed recognition.
arXiv Detail & Related papers (2020-04-07T20:57:44Z) - Imbalanced Data Learning by Minority Class Augmentation using Capsule
Adversarial Networks [31.073558420480964]
We propose a method to restore the balance in imbalanced images, by coalescing two concurrent methods.
In our model, generative and discriminative networks play a novel competitive game.
The coalescing of capsule-GAN is effective at recognizing highly overlapping classes with much fewer parameters compared with the convolutional-GAN.
arXiv Detail & Related papers (2020-04-05T12:36:06Z) - M2m: Imbalanced Classification via Major-to-minor Translation [79.09018382489506]
In most real-world scenarios, labeled training datasets are highly class-imbalanced, where deep neural networks suffer from generalizing to a balanced testing criterion.
In this paper, we explore a novel yet simple way to alleviate this issue by augmenting less-frequent classes via translating samples from more-frequent classes.
Our experimental results on a variety of class-imbalanced datasets show that the proposed method improves the generalization on minority classes significantly compared to other existing re-sampling or re-weighting methods.
arXiv Detail & Related papers (2020-04-01T13:21:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.