Technical Report: Generating the WEB-IDS23 Dataset
- URL: http://arxiv.org/abs/2502.03909v1
- Date: Thu, 06 Feb 2025 09:33:02 GMT
- Title: Technical Report: Generating the WEB-IDS23 Dataset
- Authors: Eric Lanfer, Dominik Brockmann, Nils Aschenbruck,
- Abstract summary: Several widely used datasets do not include labels which are fine-grained enough.<n> modular traffic generator can simulate a wide variety of benign and malicious traffic.<n> dataset captures over 12 million samples with 82 flow-level features and 21 fine-grained labels.
- Score: 1.1101390076342181
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Anomaly-based Network Intrusion Detection Systems (NIDS) require correctly labelled, representative and diverse datasets for an accurate evaluation and development. However, several widely used datasets do not include labels which are fine-grained enough and, together with small sample sizes, can lead to overfitting issues that also remain undetected when using test data. Additionally, the cybersecurity sector is evolving fast, and new attack mechanisms require the continuous creation of up-to-date datasets. To address these limitations, we developed a modular traffic generator that can simulate a wide variety of benign and malicious traffic. It incorporates multiple protocols, variability through randomization techniques and can produce attacks along corresponding benign traffic, as it occurs in real-world scenarios. Using the traffic generator, we create a dataset capturing over 12 million samples with 82 flow-level features and 21 fine-grained labels. Additionally, we include several web attack types which are often underrepresented in other datasets.
Related papers
- What Does Normal Even Mean? Evaluating Benign Traffic in Intrusion Detection Datasets [0.0]
Supervised machine learning techniques rely on labeled data to achieve high task performance.<n>This paper evaluates the structure of benign traffic in several common intrusion detection datasets.
arXiv Detail & Related papers (2025-09-11T15:55:21Z) - LMDG: Advancing Lateral Movement Detection Through High-Fidelity Dataset Generation [0.2399911126932527]
Lateral Movement (LM) attacks pose a significant threat to enterprise security.<n>Development and evaluation of LM detection systems are impeded by the absence of realistic, well-labeled datasets.<n>We propose LMDG, a scalable framework for generating high-fidelity LM datasets.
arXiv Detail & Related papers (2025-08-04T22:49:04Z) - Feature Shift Localization Network [51.33484517421393]
We introduce a neural network that can localize feature shifts in large and high-dimensional datasets in a fast and accurate manner.<n>The network, trained with a large number of datasets, learns to extract the statistical properties of the datasets and can localize feature shifts without the need for re-training.
arXiv Detail & Related papers (2025-06-10T15:27:32Z) - A Novel Approach to Network Traffic Analysis: the HERA tool [0.0]
Cybersecurity threats highlight the need for robust network intrusion detection systems.<n>These systems rely heavily on datasets to train machine learning models capable of detecting patterns and predicting threats.<n> HERA is a new open-source tool that generates flow files and labelled or unlabelled datasets with user-defined features.
arXiv Detail & Related papers (2025-01-13T16:47:52Z) - Unleashing the Power of Unlabeled Data: A Self-supervised Learning Framework for Cyber Attack Detection in Smart Grids [6.5023425872686085]
We propose a self-supervised learning-based framework to detect and identify various types of cyber attacks.
The proposed framework does not rely on large amounts of well-curated labeled data but makes use of the massive unlabeled data in the wild.
Experiment results in a 5-area power grid system with 37 buses demonstrate the superior performance of our framework over existing approaches.
arXiv Detail & Related papers (2024-05-22T20:04:52Z) - TII-SSRC-23 Dataset: Typological Exploration of Diverse Traffic Patterns
for Intrusion Detection [0.5261718469769447]
Existing datasets often fall short, lacking the necessary diversity and alignment with the contemporary network environment.
This paper introduces TII-SSRC-23, a novel and comprehensive dataset designed to overcome these challenges.
arXiv Detail & Related papers (2023-09-14T05:23:36Z) - Fusing Pseudo Labels with Weak Supervision for Dynamic Traffic Scenarios [0.0]
We introduce a weakly-supervised label unification pipeline that amalgamates pseudo labels from object detection models trained on heterogeneous datasets.
Our pipeline engenders a unified label space through the amalgamation of labels from disparate datasets, rectifying bias and enhancing generalization.
We retrain a solitary object detection model using the merged label space, culminating in a resilient model proficient in dynamic traffic scenarios.
arXiv Detail & Related papers (2023-08-30T11:33:07Z) - Tackling Diverse Minorities in Imbalanced Classification [80.78227787608714]
Imbalanced datasets are commonly observed in various real-world applications, presenting significant challenges in training classifiers.
We propose generating synthetic samples iteratively by mixing data samples from both minority and majority classes.
We demonstrate the effectiveness of our proposed framework through extensive experiments conducted on seven publicly available benchmark datasets.
arXiv Detail & Related papers (2023-08-28T18:48:34Z) - Anomaly Detection Dataset for Industrial Control Systems [1.2234742322758418]
Industrial Control Systems (ICSs) have been targeted by cyberattacks and are becoming increasingly vulnerable.
The lack of suitable datasets for evaluating Machine Learning algorithms is a challenge.
This paper presents the 'ICS-Flow' dataset, which offers network data and process state variables logs for supervised and unsupervised ML-based IDS assessment.
arXiv Detail & Related papers (2023-05-11T14:52:19Z) - TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual
Environments [84.6017003787244]
This work proposes a synthetic data generation pipeline to address the difficulties and domain-gaps present in simulated datasets.
We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation.
arXiv Detail & Related papers (2022-08-16T20:46:08Z) - Attentive Prototypes for Source-free Unsupervised Domain Adaptive 3D
Object Detection [85.11649974840758]
3D object detection networks tend to be biased towards the data they are trained on.
We propose a single-frame approach for source-free, unsupervised domain adaptation of lidar-based 3D object detectors.
arXiv Detail & Related papers (2021-11-30T18:42:42Z) - Unsupervised Person Re-Identification with Wireless Positioning under
Weak Scene Labeling [131.18390399368997]
We propose to explore unsupervised person re-identification with both visual data and wireless positioning trajectories under weak scene labeling.
Specifically, we propose a novel unsupervised multimodal training framework (UMTF), which models the complementarity of visual data and wireless information.
Our UMTF contains a multimodal data association strategy (MMDA) and a multimodal graph neural network (MMGN)
arXiv Detail & Related papers (2021-10-29T08:25:44Z) - Unsupervised Domain Adaptive Learning via Synthetic Data for Person
Re-identification [101.1886788396803]
Person re-identification (re-ID) has gained more and more attention due to its widespread applications in video surveillance.
Unfortunately, the mainstream deep learning methods still need a large quantity of labeled data to train models.
In this paper, we develop a data collector to automatically generate synthetic re-ID samples in a computer game, and construct a data labeler to simultaneously annotate them.
arXiv Detail & Related papers (2021-09-12T15:51:41Z) - Contextual-Bandit Anomaly Detection for IoT Data in Distributed
Hierarchical Edge Computing [65.78881372074983]
IoT devices can hardly afford complex deep neural networks (DNN) models, and offloading anomaly detection tasks to the cloud incurs long delay.
We propose and build a demo for an adaptive anomaly detection approach for distributed hierarchical edge computing (HEC) systems.
We show that our proposed approach significantly reduces detection delay without sacrificing accuracy, as compared to offloading detection tasks to the cloud.
arXiv Detail & Related papers (2020-04-15T06:13:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.