Data Curation and Quality Assurance for Machine Learning-based Cyber
Intrusion Detection
- URL: http://arxiv.org/abs/2105.10041v1
- Date: Thu, 20 May 2021 21:31:46 GMT
- Title: Data Curation and Quality Assurance for Machine Learning-based Cyber
Intrusion Detection
- Authors: Haihua Chen, Ngan Tran, Anand Sagar Thumati, Jay Bhuyan, Junhua Ding
- Abstract summary: This article first summarizes existing machine learning-based intrusion detection systems and the datasets used for building these systems.
The experimental results show that BERT and GPT were the best algorithms for HIDS on all of the datasets.
We then evaluate the data quality of the 11 datasets based on quality dimensions proposed in this paper to determine the best characteristics that a HIDS dataset should possess in order to yield the best possible result.
- Score: 1.0276024900942873
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Intrusion detection is an essential task in the cyber threat environment.
Machine learning and deep learning techniques have been applied for intrusion
detection. However, most of the existing research focuses on the model work but
ignores the fact that poor data quality has a direct impact on the performance
of a machine learning system. More attention should be paid to the data work
when building a machine learning-based intrusion detection system. This article
first summarizes existing machine learning-based intrusion detection systems
and the datasets used for building these systems. Then the data preparation
workflow and quality requirements for intrusion detection are discussed. To
figure out how data and models affect machine learning performance, we
conducted experiments on 11 HIDS datasets using seven machine learning models
and three deep learning models. The experimental results show that BERT and GPT
were the best algorithms for HIDS on all of the datasets. However, the
performance on different datasets varies, indicating the differences between
the data quality of these datasets. We then evaluate the data quality of the 11
datasets based on quality dimensions proposed in this paper to determine the
best characteristics that a HIDS dataset should possess in order to yield the
best possible result. This research initiates a data quality perspective for
researchers and practitioners to improve the performance of machine
learning-based intrusion detection.
Related papers
- Towards Understanding the Impact of Data Bugs on Deep Learning Models in Software Engineering [13.17302533571231]
Deep learning (DL) systems are prone to bugs from many sources, including training data.
Existing literature suggests that bugs in training data are highly prevalent.
We investigate three types of data prevalent in software engineering tasks: code-based, text-based, and metric-based.
arXiv Detail & Related papers (2024-11-19T00:28:20Z) - On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms [56.119374302685934]
There have been severe concerns over the trustworthiness of AI technologies.
Machine and deep learning algorithms depend heavily on the data used during their development.
We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z) - TII-SSRC-23 Dataset: Typological Exploration of Diverse Traffic Patterns
for Intrusion Detection [0.5261718469769447]
Existing datasets often fall short, lacking the necessary diversity and alignment with the contemporary network environment.
This paper introduces TII-SSRC-23, a novel and comprehensive dataset designed to overcome these challenges.
arXiv Detail & Related papers (2023-09-14T05:23:36Z) - Defect Classification in Additive Manufacturing Using CNN-Based Vision
Processing [76.72662577101988]
This paper examines two scenarios: first, using convolutional neural networks (CNNs) to accurately classify defects in an image dataset from AM and second, applying active learning techniques to the developed classification model.
This allows the construction of a human-in-the-loop mechanism to reduce the size of the data required to train and generate training data.
arXiv Detail & Related papers (2023-07-14T14:36:58Z) - ECS -- an Interactive Tool for Data Quality Assurance [63.379471124899915]
We present a novel approach for the assurance of data quality.
For this purpose, the mathematical basics are first discussed and the approach is presented using multiple examples.
This results in the detection of data points with potentially harmful properties for the use in safety-critical systems.
arXiv Detail & Related papers (2023-07-10T06:49:18Z) - Assessing Dataset Quality Through Decision Tree Characteristics in
Autoencoder-Processed Spaces [0.30458514384586394]
We show the profound impact of dataset quality on model training and performance.
Our findings underscore the importance of appropriate feature selection, adequate data volume, and data quality.
This research offers valuable insights into data assessment practices, contributing to the development of more accurate and robust machine learning models.
arXiv Detail & Related papers (2023-06-27T11:33:31Z) - Striving for data-model efficiency: Identifying data externalities on
group performance [75.17591306911015]
Building trustworthy, effective, and responsible machine learning systems hinges on understanding how differences in training data and modeling decisions interact to impact predictive performance.
We focus on a particular type of data-model inefficiency, in which adding training data from some sources can actually lower performance evaluated on key sub-groups of the population.
Our results indicate that data-efficiency is a key component of both accurate and trustworthy machine learning.
arXiv Detail & Related papers (2022-11-11T16:48:27Z) - Advancing Reacting Flow Simulations with Data-Driven Models [50.9598607067535]
Key to effective use of machine learning tools in multi-physics problems is to couple them to physical and computer models.
The present chapter reviews some of the open opportunities for the application of data-driven reduced-order modeling of combustion systems.
arXiv Detail & Related papers (2022-09-05T16:48:34Z) - Data Quality Toolkit: Automatic assessment of data quality and
remediation for machine learning datasets [11.417891017429882]
The Data Quality Toolkit for machine learning is a library of some key quality metrics and relevant remediation techniques.
It can reduce the turn-around times of data preparation pipelines and streamline the data quality assessment process.
arXiv Detail & Related papers (2021-08-12T19:22:27Z) - Data Quality Measures and Efficient Evaluation Algorithms for
Large-Scale High-Dimensional Data [0.15229257192293197]
We propose two data quality measures that can compute class separability and in-class variability, the two important aspects of data quality, for a given dataset.
We provide efficient algorithms to compute our quality measures based on random projections and bootstrapping with statistical benefits on large-scale high-dimensional data.
arXiv Detail & Related papers (2021-01-05T10:23:08Z) - AutoOD: Automated Outlier Detection via Curiosity-guided Search and
Self-imitation Learning [72.99415402575886]
Outlier detection is an important data mining task with numerous practical applications.
We propose AutoOD, an automated outlier detection framework, which aims to search for an optimal neural network model.
Experimental results on various real-world benchmark datasets demonstrate that the deep model identified by AutoOD achieves the best performance.
arXiv Detail & Related papers (2020-06-19T18:57:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.