Low-Quality Training Data Only? A Robust Framework for Detecting Encrypted Malicious Network Traffic
- URL: http://arxiv.org/abs/2309.04798v1
- Date: Sat, 9 Sep 2023 13:49:30 GMT
- Title: Low-Quality Training Data Only? A Robust Framework for Detecting Encrypted Malicious Network Traffic
- Authors: Yuqi Qing, Qilei Yin, Xinhao Deng, Yihao Chen, Zhuotao Liu, Kun Sun, Ke Xu, Jia Zhang, Qi Li,
- Abstract summary: When machine learning models are trained with low-quality training data, they suffer degraded performance.
We develop RAPIER that fully utilizes different distributions of normal and malicious traffic data in the feature space.
RAPIER effectively achieves encrypted malicious traffic detection with the best F1 score of 0.773 and improves the F1 score of existing methods by an average of 272.5%.
- Score: 19.636282208765547
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Machine learning (ML) is promising in accurately detecting malicious flows in encrypted network traffic; however, it is challenging to collect a training dataset that contains a sufficient amount of encrypted malicious data with correct labels. When ML models are trained with low-quality training data, they suffer degraded performance. In this paper, we aim at addressing a real-world low-quality training dataset problem, namely, detecting encrypted malicious traffic generated by continuously evolving malware. We develop RAPIER that fully utilizes different distributions of normal and malicious traffic data in the feature space, where normal data is tightly distributed in a certain area and the malicious data is scattered over the entire feature space to augment training data for model training. RAPIER includes two pre-processing modules to convert traffic into feature vectors and correct label noises. We evaluate our system on two public datasets and one combined dataset. With 1000 samples and 45% noises from each dataset, our system achieves the F1 scores of 0.770, 0.776, and 0.855, respectively, achieving average improvements of 352.6%, 284.3%, and 214.9% over the existing methods, respectively. Furthermore, We evaluate RAPIER with a real-world dataset obtained from a security enterprise. RAPIER effectively achieves encrypted malicious traffic detection with the best F1 score of 0.773 and improves the F1 score of existing methods by an average of 272.5%.
Related papers
- Decorrelating Structure via Adapters Makes Ensemble Learning Practical for Semi-supervised Learning [50.868594148443215]
In computer vision, traditional ensemble learning methods exhibit either a low training efficiency or the limited performance.
We propose a lightweight, loss-function-free, and architecture-agnostic ensemble learning by the Decorrelating Structure via Adapters (DSA) for various visual tasks.
arXiv Detail & Related papers (2024-08-08T01:31:38Z) - How to Train Data-Efficient LLMs [56.41105687693619]
We study data-efficient approaches for pre-training language models (LLMs)
We find that Ask-LLM and Density sampling are the best methods in their respective categories.
In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories.
arXiv Detail & Related papers (2024-02-15T02:27:57Z) - FedCSD: A Federated Learning Based Approach for Code-Smell Detection [7.026278088747708]
This paper proposes a Federated Learning Code Smell Detection approach that allows organizations to collaboratively train ML models.
Three experiments have leveraged three manually validated datasets aimed at detecting and examining different code smell scenarios.
An accuracy of 98.34% was achieved by the global model that has been trained using 10 companies for 100 training rounds.
arXiv Detail & Related papers (2023-05-31T09:51:45Z) - ET-BERT: A Contextualized Datagram Representation with Pre-training
Transformers for Encrypted Traffic Classification [9.180725486824118]
We propose a new traffic representation model called Encrypted Traffic Bidirectional Representations from Transformer (ET-BERT)
The pre-trained model can be fine-tuned on a small number of task-specific labeled data and achieves state-of-the-art performance across five encrypted traffic classification tasks.
arXiv Detail & Related papers (2022-02-13T14:54:48Z) - Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions.
In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data.
We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z) - A data-centric weak supervised learning for highway traffic incident
detection [1.0323063834827415]
We focus on a data-centric approach to improve the accuracy and reduce the false alarm rate of traffic incident detection on highways.
We develop a weak supervised learning workflow to generate high-quality training labels for the incident data without the ground truth labels.
Overall, we show that our proposed weak supervised learning workflow achieves a high incident detection rate (0.90) and low false alarm rate (0.08)
arXiv Detail & Related papers (2021-12-17T22:14:47Z) - Attentive Prototypes for Source-free Unsupervised Domain Adaptive 3D
Object Detection [85.11649974840758]
3D object detection networks tend to be biased towards the data they are trained on.
We propose a single-frame approach for source-free, unsupervised domain adaptation of lidar-based 3D object detectors.
arXiv Detail & Related papers (2021-11-30T18:42:42Z) - What Stops Learning-based 3D Registration from Working in the Real
World? [53.68326201131434]
This work identifies the sources of 3D point cloud registration failures, analyze the reasons behind them, and propose solutions.
Ultimately, this translates to a best-practice 3D registration network (BPNet), constituting the first learning-based method able to handle previously-unseen objects in real-world data.
Our model generalizes to real data without any fine-tuning, reaching an accuracy of up to 67% on point clouds of unseen objects obtained with a commercial sensor.
arXiv Detail & Related papers (2021-11-19T19:24:27Z) - Active Learning of Neural Collision Handler for Complex 3D Mesh
Deformations [68.0524382279567]
We present a robust learning algorithm to detect and handle collisions in 3D deforming meshes.
Our approach outperforms supervised learning methods and achieves $93.8-98.1%$ accuracy.
arXiv Detail & Related papers (2021-10-08T04:08:31Z) - Modern Cybersecurity Solution using Supervised Machine Learning [0.456877715768796]
Traditional Firewall and Intrusion Detection system fails to detect new attacks, zero-day attacks, and traffic patterns that do not match with configured rules.
We used Netflow datasets to extract features after applying data analysis.
Our experiments focus on how efficient machine learning algorithms can detect Bot traffic, Malware traffic, and background traffic.
arXiv Detail & Related papers (2021-09-15T22:03:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.