Imbalanced data preprocessing techniques utilizing local data
characteristics
- URL: http://arxiv.org/abs/2111.14120v1
- Date: Sun, 28 Nov 2021 11:48:26 GMT
- Title: Imbalanced data preprocessing techniques utilizing local data
characteristics
- Authors: Micha{\l} Koziarski
- Abstract summary: Data imbalance is the disproportion between the number of training observations coming from different classes.
The focus of this thesis is development of novel data resampling strategies.
- Score: 2.28438857884398
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data imbalance, that is the disproportion between the number of training
observations coming from different classes, remains one of the most significant
challenges affecting contemporary machine learning. The negative impact of data
imbalance on traditional classification algorithms can be reduced by the data
preprocessing techniques, methods that manipulate the training data to
artificially reduce the degree of imbalance. However, the existing data
preprocessing techniques, in particular SMOTE and its derivatives, which
constitute the most prevalent paradigm of imbalanced data preprocessing, tend
to be susceptible to various data difficulty factors. This is in part due to
the fact that the original SMOTE algorithm does not utilize the information
about majority class observations. The focus of this thesis is development of
novel data resampling strategies natively utilizing the information about the
distribution of both minority and majority class. The thesis summarizes the
content of 12 research papers focused on the proposed binary data resampling
strategies, their translation to the multi-class setting, and the practical
application to the problem of histopathological data classification.
Related papers
- Improving SMOTE via Fusing Conditional VAE for Data-adaptive Noise Filtering [0.5735035463793009]
We introduce a framework to enhance the SMOTE algorithm using Variational Autoencoders (VAE)
Our approach systematically quantifies the density of data points in a low-dimensional latent space using the VAE, simultaneously incorporating information on class labels and classification difficulty.
Empirical studies on several imbalanced datasets represent that this simple process innovatively improves the conventional SMOTE algorithm over the deep learning models.
arXiv Detail & Related papers (2024-05-30T07:06:02Z) - Few-shot learning for COVID-19 Chest X-Ray Classification with
Imbalanced Data: An Inter vs. Intra Domain Study [49.5374512525016]
Medical image datasets are essential for training models used in computer-aided diagnosis, treatment planning, and medical research.
Some challenges are associated with these datasets, including variability in data distribution, data scarcity, and transfer learning issues when using models pre-trained from generic images.
We propose a methodology based on Siamese neural networks in which a series of techniques are integrated to mitigate the effects of data scarcity and distribution imbalance.
arXiv Detail & Related papers (2024-01-18T16:59:27Z) - A Pretrainer's Guide to Training Data: Measuring the Effects of Data
Age, Domain Coverage, Quality, & Toxicity [84.6421260559093]
This study is the largest set of experiments to validate, quantify, and expose undocumented intuitions about text pretraining.
Our findings indicate there does not exist a one-size-fits-all solution to filtering training data.
arXiv Detail & Related papers (2023-05-22T15:57:53Z) - Is augmentation effective to improve prediction in imbalanced text
datasets? [3.1690891866882236]
We argue that adjusting the cutoffs without data augmentation can produce similar results to oversampling techniques.
Our findings contribute to a better understanding of the strengths and limitations of different approaches to dealing with imbalanced data.
arXiv Detail & Related papers (2023-04-20T13:07:31Z) - On-the-fly Denoising for Data Augmentation in Natural Language
Understanding [101.46848743193358]
We propose an on-the-fly denoising technique for data augmentation that learns from soft augmented labels provided by an organic teacher model trained on the cleaner original data.
Our method can be applied to general augmentation techniques and consistently improve the performance on both text classification and question-answering tasks.
arXiv Detail & Related papers (2022-12-20T18:58:33Z) - Systematic Evaluation of Predictive Fairness [60.0947291284978]
Mitigating bias in training on biased datasets is an important open problem.
We examine the performance of various debiasing methods across multiple tasks.
We find that data conditions have a strong influence on relative model performance.
arXiv Detail & Related papers (2022-10-17T05:40:13Z) - An Empirical Analysis of the Efficacy of Different Sampling Techniques
for Imbalanced Classification [0.0]
The prevalence of imbalance in real-world datasets has led to the creation of a multitude of strategies for the class imbalance issue.
Standard classification algorithms tend to perform poorly when trained on imbalanced data.
We present a comprehensive analysis of 26 popular sampling techniques to understand their effectiveness in dealing with imbalanced data.
arXiv Detail & Related papers (2022-08-25T03:45:34Z) - Imbalanced Classification via Explicit Gradient Learning From Augmented
Data [0.0]
We propose a novel deep meta-learning technique to augment a given imbalanced dataset with new minority instances.
The advantage of the proposed method is demonstrated on synthetic and real-world datasets with various imbalance ratios.
arXiv Detail & Related papers (2022-02-21T22:16:50Z) - CMW-Net: Learning a Class-Aware Sample Weighting Mapping for Robust Deep
Learning [55.733193075728096]
Modern deep neural networks can easily overfit to biased training data containing corrupted labels or class imbalance.
Sample re-weighting methods are popularly used to alleviate this data bias issue.
We propose a meta-model capable of adaptively learning an explicit weighting scheme directly from data.
arXiv Detail & Related papers (2022-02-11T13:49:51Z) - Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class.
This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples.
Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z) - Combined Cleaning and Resampling Algorithm for Multi-Class Imbalanced
Data with Label Noise [11.868507571027626]
In this paper, we propose a novel oversampling technique, a Multi-Class Combined Cleaning and Resampling algorithm.
The proposed method utilizes an energy-based approach to modeling the regions suitable for oversampling, less affected by small disjuncts and outliers than SMOTE.
It combines it with a simultaneous cleaning operation, the aim of which is to reduce the effect of overlapping class distributions on the performance of the learning algorithms.
arXiv Detail & Related papers (2020-04-07T13:59:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.