Related papers: Imbalanced data preprocessing techniques utilizing local data characteristics

Imbalanced data preprocessing techniques utilizing local data characteristics

URL: http://arxiv.org/abs/2111.14120v1
Date: Sun, 28 Nov 2021 11:48:26 GMT
Title: Imbalanced data preprocessing techniques utilizing local data characteristics
Authors: Micha{\l} Koziarski
Abstract summary: Data imbalance is the disproportion between the number of training observations coming from different classes. The focus of this thesis is development of novel data resampling strategies.
Score: 2.28438857884398
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Data imbalance, that is the disproportion between the number of training observations coming from different classes, remains one of the most significant challenges affecting contemporary machine learning. The negative impact of data imbalance on traditional classification algorithms can be reduced by the data preprocessing techniques, methods that manipulate the training data to artificially reduce the degree of imbalance. However, the existing data preprocessing techniques, in particular SMOTE and its derivatives, which constitute the most prevalent paradigm of imbalanced data preprocessing, tend to be susceptible to various data difficulty factors. This is in part due to the fact that the original SMOTE algorithm does not utilize the information about majority class observations. The focus of this thesis is development of novel data resampling strategies natively utilizing the information about the distribution of both minority and majority class. The thesis summarizes the content of 12 research papers focused on the proposed binary data resampling strategies, their translation to the multi-class setting, and the practical application to the problem of histopathological data classification.

Related papers

Leveraging Semi-Supervised Learning to Enhance Data Mining for Image Classification under Limited Labeled Data [35.431340001608476]
Traditional data mining methods are inadequate when faced with large-scale, high-dimensional and complex data. This study introduces semi-supervised learning methods, aiming to improve the algorithm's ability to utilize unlabeled data. Specifically, we adopt a self-training method and combine it with a convolutional neural network (CNN) for image feature extraction and classification.
arXiv Detail & Related papers (2024-11-27T18:59:50Z)
Differential Privacy Under Class Imbalance: Methods and Empirical Insights [11.378192651089359]
Imbalanced learning occurs when the distribution of class-labels is highly skewed in the training data. We consider DP variants of pre-processing methods that privately augment the original dataset to reduce the class imbalance. We also consider DP variants of in-processing techniques, which adjust the learning algorithm to account for the imbalance. For each method, we either adapt an existing imbalanced learning technique to the private setting or demonstrate its incompatibility with differential privacy.
arXiv Detail & Related papers (2024-11-08T17:46:56Z)
Improving SMOTE via Fusing Conditional VAE for Data-adaptive Noise Filtering [0.5735035463793009]
We introduce a framework to enhance the SMOTE algorithm using Variational Autoencoders (VAE) Our approach systematically quantifies the density of data points in a low-dimensional latent space using the VAE, simultaneously incorporating information on class labels and classification difficulty. Empirical studies on several imbalanced datasets represent that this simple process innovatively improves the conventional SMOTE algorithm over the deep learning models.
arXiv Detail & Related papers (2024-05-30T07:06:02Z)
Few-shot learning for COVID-19 Chest X-Ray Classification with Imbalanced Data: An Inter vs. Intra Domain Study [49.5374512525016]
Medical image datasets are essential for training models used in computer-aided diagnosis, treatment planning, and medical research. Some challenges are associated with these datasets, including variability in data distribution, data scarcity, and transfer learning issues when using models pre-trained from generic images. We propose a methodology based on Siamese neural networks in which a series of techniques are integrated to mitigate the effects of data scarcity and distribution imbalance.
arXiv Detail & Related papers (2024-01-18T16:59:27Z)
A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity [84.6421260559093]
This study is the largest set of experiments to validate, quantify, and expose undocumented intuitions about text pretraining. Our findings indicate there does not exist a one-size-fits-all solution to filtering training data.
arXiv Detail & Related papers (2023-05-22T15:57:53Z)
Is augmentation effective to improve prediction in imbalanced text datasets? [3.1690891866882236]
We argue that adjusting the cutoffs without data augmentation can produce similar results to oversampling techniques. Our findings contribute to a better understanding of the strengths and limitations of different approaches to dealing with imbalanced data.
arXiv Detail & Related papers (2023-04-20T13:07:31Z)
On-the-fly Denoising for Data Augmentation in Natural Language Understanding [101.46848743193358]
We propose an on-the-fly denoising technique for data augmentation that learns from soft augmented labels provided by an organic teacher model trained on the cleaner original data. Our method can be applied to general augmentation techniques and consistently improve the performance on both text classification and question-answering tasks.
arXiv Detail & Related papers (2022-12-20T18:58:33Z)
Systematic Evaluation of Predictive Fairness [60.0947291284978]
Mitigating bias in training on biased datasets is an important open problem. We examine the performance of various debiasing methods across multiple tasks. We find that data conditions have a strong influence on relative model performance.
arXiv Detail & Related papers (2022-10-17T05:40:13Z)
An Empirical Analysis of the Efficacy of Different Sampling Techniques for Imbalanced Classification [0.0]
The prevalence of imbalance in real-world datasets has led to the creation of a multitude of strategies for the class imbalance issue. Standard classification algorithms tend to perform poorly when trained on imbalanced data. We present a comprehensive analysis of 26 popular sampling techniques to understand their effectiveness in dealing with imbalanced data.
arXiv Detail & Related papers (2022-08-25T03:45:34Z)
Imbalanced Classification via Explicit Gradient Learning From Augmented Data [0.0]
We propose a novel deep meta-learning technique to augment a given imbalanced dataset with new minority instances. The advantage of the proposed method is demonstrated on synthetic and real-world datasets with various imbalance ratios.
arXiv Detail & Related papers (2022-02-21T22:16:50Z)
CMW-Net: Learning a Class-Aware Sample Weighting Mapping for Robust Deep Learning [55.733193075728096]
Modern deep neural networks can easily overfit to biased training data containing corrupted labels or class imbalance. Sample re-weighting methods are popularly used to alleviate this data bias issue. We propose a meta-model capable of adaptively learning an explicit weighting scheme directly from data.
arXiv Detail & Related papers (2022-02-11T13:49:51Z)
Selecting the suitable resampling strategy for imbalanced data classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class. This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples. Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.