Benchmark of Data Preprocessing Methods for Imbalanced Classification
- URL: http://arxiv.org/abs/2303.03094v1
- Date: Mon, 6 Mar 2023 13:12:43 GMT
- Title: Benchmark of Data Preprocessing Methods for Imbalanced Classification
- Authors: Radovan Halu\v{s}ka, Jan Brabec and Tom\'a\v{s} Kom\'arek
- Abstract summary: Severe class imbalance is one of the main conditions that make machine learning in cybersecurity difficult.
This paper presents a benchmark of 16 preprocessing methods on six cybersecurity datasets together with 17 public imbalanced datasets from other domains.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Severe class imbalance is one of the main conditions that make machine
learning in cybersecurity difficult. A variety of dataset preprocessing methods
have been introduced over the years. These methods modify the training dataset
by oversampling, undersampling or a combination of both to improve the
predictive performance of classifiers trained on this dataset. Although these
methods are used in cybersecurity occasionally, a comprehensive, unbiased
benchmark comparing their performance over a variety of cybersecurity problems
is missing. This paper presents a benchmark of 16 preprocessing methods on six
cybersecurity datasets together with 17 public imbalanced datasets from other
domains. We test the methods under multiple hyperparameter configurations and
use an AutoML system to train classifiers on the preprocessed datasets, which
reduces potential bias from specific hyperparameter or classifier choices.
Special consideration is also given to evaluating the methods using appropriate
performance measures that are good proxies for practical performance in
real-world cybersecurity systems. The main findings of our study are: 1) Most
of the time, a data preprocessing method that improves classification
performance exists. 2) Baseline approach of doing nothing outperformed a large
portion of methods in the benchmark. 3) Oversampling methods generally
outperform undersampling methods. 4) The most significant performance gains are
brought by the standard SMOTE algorithm and more complicated methods provide
mainly incremental improvements at the cost of often worse computational
performance.
Related papers
- Characterizing the Optimal 0-1 Loss for Multi-class Classification with
a Test-time Attacker [57.49330031751386]
We find achievable information-theoretic lower bounds on loss in the presence of a test-time attacker for multi-class classifiers on any discrete dataset.
We provide a general framework for finding the optimal 0-1 loss that revolves around the construction of a conflict hypergraph from the data and adversarial constraints.
arXiv Detail & Related papers (2023-02-21T15:17:13Z) - Revisiting Long-tailed Image Classification: Survey and Benchmarks with
New Evaluation Metrics [88.39382177059747]
A corpus of metrics is designed for measuring the accuracy, robustness, and bounds of algorithms for learning with long-tailed distribution.
Based on our benchmarks, we re-evaluate the performance of existing methods on CIFAR10 and CIFAR100 datasets.
arXiv Detail & Related papers (2023-02-03T02:40:54Z) - Fraud Detection Using Optimized Machine Learning Tools Under Imbalance
Classes [0.304585143845864]
Fraud detection with smart versions of machine learning (ML) tools is essential to assure safety.
We investigate four state-of-the-art ML techniques, namely, logistic regression, decision trees, random forest, and extreme gradient boost.
For phishing website URLs and credit card fraud transaction datasets, the results indicate that extreme gradient boost trained on the original data shows trustworthy performance.
arXiv Detail & Related papers (2022-09-04T15:30:23Z) - Continual Learning For On-Device Environmental Sound Classification [63.81276321857279]
We propose a simple and efficient continual learning method for on-device environmental sound classification.
Our method selects the historical data for the training by measuring the per-sample classification uncertainty.
arXiv Detail & Related papers (2022-07-15T12:13:04Z) - Distributed Dynamic Safe Screening Algorithms for Sparse Regularization [73.85961005970222]
We propose a new distributed dynamic safe screening (DDSS) method for sparsity regularized models and apply it on shared-memory and distributed-memory architecture respectively.
We prove that the proposed method achieves the linear convergence rate with lower overall complexity and can eliminate almost all the inactive features in a finite number of iterations almost surely.
arXiv Detail & Related papers (2022-04-23T02:45:55Z) - Solving the Class Imbalance Problem Using a Counterfactual Method for
Data Augmentation [4.454557728745761]
Learning from class imbalanced datasets poses challenges for machine learning algorithms.
We advance a novel data augmentation method (adapted from eXplainable AI) that generates synthetic, counterfactual instances in the minority class.
Several experiments using four different classifiers and 25 datasets are reported, which show that this Counterfactual Augmentation method (CFA) generates useful synthetic data points in the minority class.
arXiv Detail & Related papers (2021-11-05T14:14:06Z) - Semantic Perturbations with Normalizing Flows for Improved
Generalization [62.998818375912506]
We show that perturbations in the latent space can be used to define fully unsupervised data augmentations.
We find that our latent adversarial perturbations adaptive to the classifier throughout its training are most effective.
arXiv Detail & Related papers (2021-08-18T03:20:00Z) - Hybrid Ensemble optimized algorithm based on Genetic Programming for
imbalanced data classification [0.0]
We propose a hybrid ensemble algorithm based on Genetic Programming (GP) for two classes of imbalanced data classification.
Experimental results show the performance of the proposed method on the specified data sets in the size of the training set shows 40% and 50% better accuracy than other dimensions of the minority class prediction.
arXiv Detail & Related papers (2021-06-02T14:14:38Z) - Does imputation matter? Benchmark for predictive models [5.802346990263708]
This paper systematically evaluates the empirical effectiveness of data imputation algorithms for predictive models.
The main contributions are (1) the recommendation of a general method for empirical benchmarking based on real-life classification tasks.
arXiv Detail & Related papers (2020-07-06T15:47:36Z) - Towards Model-Agnostic Post-Hoc Adjustment for Balancing Ranking
Fairness and Algorithm Utility [54.179859639868646]
Bipartite ranking aims to learn a scoring function that ranks positive individuals higher than negative ones from labeled data.
There have been rising concerns on whether the learned scoring function can cause systematic disparity across different protected groups.
We propose a model post-processing framework for balancing them in the bipartite ranking scenario.
arXiv Detail & Related papers (2020-06-15T10:08:39Z) - Stacked Generalizations in Imbalanced Fraud Data Sets using Resampling
Methods [2.741266294612776]
This study uses stacked generalization, which is a two-step process of combining machine learning methods, called meta or super learners, for improving the performance of algorithms.
Building a test harness that accounts for all permutations of algorithm sample set pairs demonstrates that the complex, intrinsic data structures are all thoroughly tested.
arXiv Detail & Related papers (2020-04-03T20:38:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.