Related papers: Benchmark of Data Preprocessing Methods for Imbalanced Classification

Benchmark of Data Preprocessing Methods for Imbalanced Classification

URL: http://arxiv.org/abs/2303.03094v1
Date: Mon, 6 Mar 2023 13:12:43 GMT
Title: Benchmark of Data Preprocessing Methods for Imbalanced Classification
Authors: Radovan Halu\v{s}ka, Jan Brabec and Tom\'a\v{s} Kom\'arek
Abstract summary: Severe class imbalance is one of the main conditions that make machine learning in cybersecurity difficult. This paper presents a benchmark of 16 preprocessing methods on six cybersecurity datasets together with 17 public imbalanced datasets from other domains.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Severe class imbalance is one of the main conditions that make machine learning in cybersecurity difficult. A variety of dataset preprocessing methods have been introduced over the years. These methods modify the training dataset by oversampling, undersampling or a combination of both to improve the predictive performance of classifiers trained on this dataset. Although these methods are used in cybersecurity occasionally, a comprehensive, unbiased benchmark comparing their performance over a variety of cybersecurity problems is missing. This paper presents a benchmark of 16 preprocessing methods on six cybersecurity datasets together with 17 public imbalanced datasets from other domains. We test the methods under multiple hyperparameter configurations and use an AutoML system to train classifiers on the preprocessed datasets, which reduces potential bias from specific hyperparameter or classifier choices. Special consideration is also given to evaluating the methods using appropriate performance measures that are good proxies for practical performance in real-world cybersecurity systems. The main findings of our study are: 1) Most of the time, a data preprocessing method that improves classification performance exists. 2) Baseline approach of doing nothing outperformed a large portion of methods in the benchmark. 3) Oversampling methods generally outperform undersampling methods. 4) The most significant performance gains are brought by the standard SMOTE algorithm and more complicated methods provide mainly incremental improvements at the cost of often worse computational performance.

Related papers

MIBoost: A Gradient Boosting Algorithm for Variable Selection After Multiple Imputation [0.0]
In practice, analyses are often complicated by missing data.<n>We propose MIBoost, a novel algorithm that employs a uniform variable-selection mechanism across imputed datasets.
arXiv Detail & Related papers (2025-07-29T13:42:38Z)
Characterizing the Optimal 0-1 Loss for Multi-class Classification with a Test-time Attacker [57.49330031751386]
We find achievable information-theoretic lower bounds on loss in the presence of a test-time attacker for multi-class classifiers on any discrete dataset. We provide a general framework for finding the optimal 0-1 loss that revolves around the construction of a conflict hypergraph from the data and adversarial constraints.
arXiv Detail & Related papers (2023-02-21T15:17:13Z)
Revisiting Long-tailed Image Classification: Survey and Benchmarks with New Evaluation Metrics [88.39382177059747]
A corpus of metrics is designed for measuring the accuracy, robustness, and bounds of algorithms for learning with long-tailed distribution. Based on our benchmarks, we re-evaluate the performance of existing methods on CIFAR10 and CIFAR100 datasets.
arXiv Detail & Related papers (2023-02-03T02:40:54Z)
Fraud Detection Using Optimized Machine Learning Tools Under Imbalance Classes [0.304585143845864]
Fraud detection with smart versions of machine learning (ML) tools is essential to assure safety. We investigate four state-of-the-art ML techniques, namely, logistic regression, decision trees, random forest, and extreme gradient boost. For phishing website URLs and credit card fraud transaction datasets, the results indicate that extreme gradient boost trained on the original data shows trustworthy performance.
arXiv Detail & Related papers (2022-09-04T15:30:23Z)
Continual Learning For On-Device Environmental Sound Classification [63.81276321857279]
We propose a simple and efficient continual learning method for on-device environmental sound classification. Our method selects the historical data for the training by measuring the per-sample classification uncertainty.
arXiv Detail & Related papers (2022-07-15T12:13:04Z)
Distributed Dynamic Safe Screening Algorithms for Sparse Regularization [73.85961005970222]
We propose a new distributed dynamic safe screening (DDSS) method for sparsity regularized models and apply it on shared-memory and distributed-memory architecture respectively. We prove that the proposed method achieves the linear convergence rate with lower overall complexity and can eliminate almost all the inactive features in a finite number of iterations almost surely.
arXiv Detail & Related papers (2022-04-23T02:45:55Z)
Solving the Class Imbalance Problem Using a Counterfactual Method for Data Augmentation [4.454557728745761]
Learning from class imbalanced datasets poses challenges for machine learning algorithms. We advance a novel data augmentation method (adapted from eXplainable AI) that generates synthetic, counterfactual instances in the minority class. Several experiments using four different classifiers and 25 datasets are reported, which show that this Counterfactual Augmentation method (CFA) generates useful synthetic data points in the minority class.
arXiv Detail & Related papers (2021-11-05T14:14:06Z)
Semantic Perturbations with Normalizing Flows for Improved Generalization [62.998818375912506]
We show that perturbations in the latent space can be used to define fully unsupervised data augmentations. We find that our latent adversarial perturbations adaptive to the classifier throughout its training are most effective.
arXiv Detail & Related papers (2021-08-18T03:20:00Z)
Hybrid Ensemble optimized algorithm based on Genetic Programming for imbalanced data classification [0.0]
We propose a hybrid ensemble algorithm based on Genetic Programming (GP) for two classes of imbalanced data classification. Experimental results show the performance of the proposed method on the specified data sets in the size of the training set shows 40% and 50% better accuracy than other dimensions of the minority class prediction.
arXiv Detail & Related papers (2021-06-02T14:14:38Z)
Does imputation matter? Benchmark for predictive models [5.802346990263708]
This paper systematically evaluates the empirical effectiveness of data imputation algorithms for predictive models. The main contributions are (1) the recommendation of a general method for empirical benchmarking based on real-life classification tasks.
arXiv Detail & Related papers (2020-07-06T15:47:36Z)
Towards Model-Agnostic Post-Hoc Adjustment for Balancing Ranking Fairness and Algorithm Utility [54.179859639868646]
Bipartite ranking aims to learn a scoring function that ranks positive individuals higher than negative ones from labeled data. There have been rising concerns on whether the learned scoring function can cause systematic disparity across different protected groups. We propose a model post-processing framework for balancing them in the bipartite ranking scenario.
arXiv Detail & Related papers (2020-06-15T10:08:39Z)
Stacked Generalizations in Imbalanced Fraud Data Sets using Resampling Methods [2.741266294612776]
This study uses stacked generalization, which is a two-step process of combining machine learning methods, called meta or super learners, for improving the performance of algorithms. Building a test harness that accounts for all permutations of algorithm sample set pairs demonstrates that the complex, intrinsic data structures are all thoroughly tested.
arXiv Detail & Related papers (2020-04-03T20:38:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.