Effect of Balancing Data Using Synthetic Data on the Performance of
Machine Learning Classifiers for Intrusion Detection in Computer Networks
- URL: http://arxiv.org/abs/2204.00144v1
- Date: Fri, 1 Apr 2022 00:25:11 GMT
- Title: Effect of Balancing Data Using Synthetic Data on the Performance of
Machine Learning Classifiers for Intrusion Detection in Computer Networks
- Authors: Ayesha S. Dina and A. B. Siddique and D. Manivannan
- Abstract summary: Researchers in academia and industry used machine learning (ML) techniques to design and implement Intrusion Detection Systems (IDSes) for computer networks.
In many of the datasets used in such systems, data are imbalanced (i.e., not all classes have equal amount of samples)
We show that training ML models on dataset balanced with synthetic samples generated by CTGAN increased prediction accuracy by up to $8%$.
- Score: 3.233545237942899
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Attacks on computer networks have increased significantly in recent days, due
in part to the availability of sophisticated tools for launching such attacks
as well as thriving underground cyber-crime economy to support it. Over the
past several years, researchers in academia and industry used machine learning
(ML) techniques to design and implement Intrusion Detection Systems (IDSes) for
computer networks. Many of these researchers used datasets collected by various
organizations to train ML models for predicting intrusions. In many of the
datasets used in such systems, data are imbalanced (i.e., not all classes have
equal amount of samples). With unbalanced data, the predictive models developed
using ML algorithms may produce unsatisfactory classifiers which would affect
accuracy in predicting intrusions. Traditionally, researchers used
over-sampling and under-sampling for balancing data in datasets to overcome
this problem. In this work, in addition to over-sampling, we also use a
synthetic data generation method, called Conditional Generative Adversarial
Network (CTGAN), to balance data and study their effect on various ML
classifiers. To the best of our knowledge, no one else has used CTGAN to
generate synthetic samples to balance intrusion detection datasets. Based on
extensive experiments using a widely used dataset NSL-KDD, we found that
training ML models on dataset balanced with synthetic samples generated by
CTGAN increased prediction accuracy by up to $8\%$, compared to training the
same ML models over unbalanced data. Our experiments also show that the
accuracy of some ML models trained over data balanced with random over-sampling
decline compared to the same ML models trained over unbalanced data.
Related papers
- An Investigation on Machine Learning Predictive Accuracy Improvement and Uncertainty Reduction using VAE-based Data Augmentation [2.517043342442487]
Deep generative learning uses certain ML models to learn the underlying distribution of existing data and generate synthetic samples that resemble the real data.
In this study, our objective is to evaluate the effectiveness of data augmentation using variational autoencoder (VAE)-based deep generative models.
We investigated whether the data augmentation leads to improved accuracy in the predictions of a deep neural network (DNN) model trained using the augmented data.
arXiv Detail & Related papers (2024-10-24T18:15:48Z) - Towards Theoretical Understandings of Self-Consuming Generative Models [56.84592466204185]
This paper tackles the emerging challenge of training generative models within a self-consuming loop.
We construct a theoretical framework to rigorously evaluate how this training procedure impacts the data distributions learned by future models.
We present results for kernel density estimation, delivering nuanced insights such as the impact of mixed data training on error propagation.
arXiv Detail & Related papers (2024-02-19T02:08:09Z) - Machine Learning Data Suitability and Performance Testing Using Fault
Injection Testing Framework [0.0]
This paper presents the Fault Injection for Undesirable Learning in input Data (FIUL-Data) testing framework.
Data mutators explore vulnerabilities of ML systems against the effects of different fault injections.
This paper evaluates the framework using data from analytical chemistry, comprising retention time measurements of anti-sense oligonucleotides.
arXiv Detail & Related papers (2023-09-20T12:58:35Z) - Machine Learning Based Missing Values Imputation in Categorical Datasets [2.5611256859404983]
This research looked into the use of machine learning algorithms to fill in the gaps in categorical datasets.
The emphasis was on ensemble models constructed using the Error Correction Output Codes framework.
Deep learning for missing data imputation has obstacles despite these encouraging results, including the requirement for large amounts of labeled data.
arXiv Detail & Related papers (2023-06-10T03:29:48Z) - Machine Learning Force Fields with Data Cost Aware Training [94.78998399180519]
Machine learning force fields (MLFF) have been proposed to accelerate molecular dynamics (MD) simulation.
Even for the most data-efficient MLFFs, reaching chemical accuracy can require hundreds of frames of force and energy labels.
We propose a multi-stage computational framework -- ASTEROID, which lowers the data cost of MLFFs by leveraging a combination of cheap inaccurate data and expensive accurate data.
arXiv Detail & Related papers (2023-06-05T04:34:54Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Post-training Model Quantization Using GANs for Synthetic Data
Generation [57.40733249681334]
We investigate the use of synthetic data as a substitute for the calibration with real data for the quantization method.
We compare the performance of models quantized using data generated by StyleGAN2-ADA and our pre-trained DiStyleGAN, with quantization using real data and an alternative data generation method based on fractal images.
arXiv Detail & Related papers (2023-05-10T11:10:09Z) - Learning from aggregated data with a maximum entropy model [73.63512438583375]
We show how a new model, similar to a logistic regression, may be learned from aggregated data only by approximating the unobserved feature distribution with a maximum entropy hypothesis.
We present empirical evidence on several public datasets that the model learned this way can achieve performances comparable to those of a logistic model trained with the full unaggregated data.
arXiv Detail & Related papers (2022-10-05T09:17:27Z) - Effective Class-Imbalance learning based on SMOTE and Convolutional
Neural Networks [0.1074267520911262]
Imbalanced Data (ID) is a problem that deters Machine Learning (ML) models for achieving satisfactory results.
In this paper, we investigate the effectiveness of methods based on Deep Neural Networks (DNNs) and Convolutional Neural Networks (CNNs)
In order to achieve reliable results, we conducted our experiments 100 times with randomly shuffled data distributions.
arXiv Detail & Related papers (2022-09-01T07:42:16Z) - Towards Synthetic Multivariate Time Series Generation for Flare
Forecasting [5.098461305284216]
One of the limiting factors in training data-driven, rare-event prediction algorithms is the scarcity of the events of interest.
In this study, we explore the usefulness of the conditional generative adversarial network (CGAN) as a means to perform data-informed oversampling.
arXiv Detail & Related papers (2021-05-16T22:23:23Z) - Statistical model-based evaluation of neural networks [74.10854783437351]
We develop an experimental setup for the evaluation of neural networks (NNs)
The setup helps to benchmark a set of NNs vis-a-vis minimum-mean-square-error (MMSE) performance bounds.
This allows us to test the effects of training data size, data dimension, data geometry, noise, and mismatch between training and testing conditions.
arXiv Detail & Related papers (2020-11-18T00:33:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.