Flow Exporter Impact on Intelligent Intrusion Detection Systems
- URL: http://arxiv.org/abs/2412.14021v1
- Date: Wed, 18 Dec 2024 16:38:20 GMT
- Title: Flow Exporter Impact on Intelligent Intrusion Detection Systems
- Authors: Daniela Pinto, João Vitorino, Eva Maia, Ivone Amorim, Isabel Praça,
- Abstract summary: High-quality datasets are critical for training machine learning models.
Inconsistencies in feature generation can hinder the accuracy and reliability of threat detection.
This paper investigates the impact of flow exporters on the performance and reliability of machine learning models for intrusion detection.
- Score: 0.0
- License:
- Abstract: High-quality datasets are critical for training machine learning models, as inconsistencies in feature generation can hinder the accuracy and reliability of threat detection. For this reason, ensuring the quality of the data in network intrusion detection datasets is important. A key component of this is using reliable tools to generate the flows and features present in the datasets. This paper investigates the impact of flow exporters on the performance and reliability of machine learning models for intrusion detection. Using HERA, a tool designed to export flows and extract features, the raw network packets of two widely used datasets, UNSW-NB15 and CIC-IDS2017, were processed from PCAP files to generate new versions of these datasets. These were compared to the original ones in terms of their influence on the performance of several models, including Random Forest, XGBoost, LightGBM, and Explainable Boosting Machine. The results obtained were significant. Models trained on the HERA version of the datasets consistently outperformed those trained on the original dataset, showing improvements in accuracy and indicating a better generalisation. This highlighted the importance of flow generation in the model's ability to differentiate between benign and malicious traffic.
Related papers
- A Novel Approach to Network Traffic Analysis: the HERA tool [0.0]
Cybersecurity threats highlight the need for robust network intrusion detection systems.
These systems rely heavily on datasets to train machine learning models capable of detecting patterns and predicting threats.
HERA is a new open-source tool that generates flow files and labelled or unlabelled datasets with user-defined features.
arXiv Detail & Related papers (2025-01-13T16:47:52Z) - Efficient Network Traffic Feature Sets for IoT Intrusion Detection [0.0]
This work evaluates the feature sets provided by a combination of different feature selection methods, namely Information Gain, Chi-Squared Test, Recursive Feature Elimination, Mean Absolute Deviation, and Dispersion Ratio, in multiple IoT network datasets.
The influence of the smaller feature sets on both the classification performance and the training time of ML models is compared, with the aim of increasing the computational efficiency of IoT intrusion detection.
arXiv Detail & Related papers (2024-06-12T09:51:29Z) - DetDiffusion: Synergizing Generative and Perceptive Models for Enhanced Data Generation and Perception [78.26734070960886]
Current perceptive models heavily depend on resource-intensive datasets.
We introduce perception-aware loss (P.A. loss) through segmentation, improving both quality and controllability.
Our method customizes data augmentation by extracting and utilizing perception-aware attribute (P.A. Attr) during generation.
arXiv Detail & Related papers (2024-03-20T04:58:03Z) - TII-SSRC-23 Dataset: Typological Exploration of Diverse Traffic Patterns
for Intrusion Detection [0.5261718469769447]
Existing datasets often fall short, lacking the necessary diversity and alignment with the contemporary network environment.
This paper introduces TII-SSRC-23, a novel and comprehensive dataset designed to overcome these challenges.
arXiv Detail & Related papers (2023-09-14T05:23:36Z) - Striving for data-model efficiency: Identifying data externalities on
group performance [75.17591306911015]
Building trustworthy, effective, and responsible machine learning systems hinges on understanding how differences in training data and modeling decisions interact to impact predictive performance.
We focus on a particular type of data-model inefficiency, in which adding training data from some sources can actually lower performance evaluated on key sub-groups of the population.
Our results indicate that data-efficiency is a key component of both accurate and trustworthy machine learning.
arXiv Detail & Related papers (2022-11-11T16:48:27Z) - Watermarking for Out-of-distribution Detection [76.20630986010114]
Out-of-distribution (OOD) detection aims to identify OOD data based on representations extracted from well-trained deep models.
We propose a general methodology named watermarking in this paper.
We learn a unified pattern that is superimposed onto features of original data, and the model's detection capability is largely boosted after watermarking.
arXiv Detail & Related papers (2022-10-27T06:12:32Z) - Data Curation and Quality Assurance for Machine Learning-based Cyber
Intrusion Detection [1.0276024900942873]
This article first summarizes existing machine learning-based intrusion detection systems and the datasets used for building these systems.
The experimental results show that BERT and GPT were the best algorithms for HIDS on all of the datasets.
We then evaluate the data quality of the 11 datasets based on quality dimensions proposed in this paper to determine the best characteristics that a HIDS dataset should possess in order to yield the best possible result.
arXiv Detail & Related papers (2021-05-20T21:31:46Z) - Hidden Biases in Unreliable News Detection Datasets [60.71991809782698]
We show that selection bias during data collection leads to undesired artifacts in the datasets.
We observed a significant drop (>10%) in accuracy for all models tested in a clean split with no train/test source overlap.
We suggest future dataset creation include a simple model as a difficulty/bias probe and future model development use a clean non-overlapping site and date split.
arXiv Detail & Related papers (2021-04-20T17:16:41Z) - Influence Functions in Deep Learning Are Fragile [52.31375893260445]
influence functions approximate the effect of samples in test-time predictions.
influence estimates are fairly accurate for shallow networks.
Hessian regularization is important to get highquality influence estimates.
arXiv Detail & Related papers (2020-06-25T18:25:59Z) - Why Normalizing Flows Fail to Detect Out-of-Distribution Data [51.552870594221865]
Normalizing flows fail to distinguish between in- and out-of-distribution data.
We demonstrate that flows learn local pixel correlations and generic image-to-latent-space transformations.
We show that by modifying the architecture of flow coupling layers we can bias the flow towards learning the semantic structure of the target data.
arXiv Detail & Related papers (2020-06-15T17:00:01Z) - On the Role of Dataset Quality and Heterogeneity in Model Confidence [27.657631193015252]
Safety-critical applications require machine learning models that output accurate and calibrated probabilities.
Uncalibrated deep networks are known to make over-confident predictions.
We study the impact of dataset quality by studying the impact of dataset size and the label noise on the model confidence.
arXiv Detail & Related papers (2020-02-23T05:13:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.