Unsupervised Dataset Cleaning Framework for Encrypted Traffic Classification
- URL: http://arxiv.org/abs/2509.00701v1
- Date: Sun, 31 Aug 2025 05:01:04 GMT
- Title: Unsupervised Dataset Cleaning Framework for Encrypted Traffic Classification
- Authors: Kun Qiu, Ying Wang, Baoqian Li, Wenjun Zhu,
- Abstract summary: We present an unsupervised framework that automatically cleans encrypted mobile traffic.<n>Our framework incurs only a 2%2.5% reduction in classification accuracy compared with manual cleaning.
- Score: 5.458928203044594
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Traffic classification, a technique for assigning network flows to predefined categories, has been widely deployed in enterprise and carrier networks. With the massive adoption of mobile devices, encryption is increasingly used in mobile applications to address privacy concerns. Consequently, traditional methods such as Deep Packet Inspection (DPI) fail to distinguish encrypted traffic. To tackle this challenge, Artificial Intelligence (AI), in particular Machine Learning (ML), has emerged as a promising solution for encrypted traffic classification. A crucial prerequisite for any ML-based approach is traffic data cleaning, which removes flows that are not useful for training (e.g., irrelevant protocols, background activity, control-plane messages, and long-lived sessions). Existing cleaning solutions depend on manual inspection of every captured packet, making the process both costly and time-consuming. In this poster, we present an unsupervised framework that automatically cleans encrypted mobile traffic. Evaluation on real-world datasets shows that our framework incurs only a 2%~2.5% reduction in classification accuracy compared with manual cleaning. These results demonstrate that our method offers an efficient and effective preprocessing step for ML-based encrypted traffic classification.
Related papers
- What Does Normal Even Mean? Evaluating Benign Traffic in Intrusion Detection Datasets [0.0]
Supervised machine learning techniques rely on labeled data to achieve high task performance.<n>This paper evaluates the structure of benign traffic in several common intrusion detection datasets.
arXiv Detail & Related papers (2025-09-11T15:55:21Z) - Language of Network: A Generative Pre-trained Model for Encrypted Traffic Comprehension [16.795038178588324]
Deep learning is currently the predominant approach for encrypted traffic classification through feature analysis.<n>We present GBC, a generative model based on pre-training for encrypted traffic comprehension.<n>It achieves superior results in both traffic classification and generation tasks, resulting in a 5% improvement in F1 score compared to state-of-the-art methods for classification tasks.
arXiv Detail & Related papers (2025-05-26T04:04:29Z) - NetFlowGen: Leveraging Generative Pre-training for Network Traffic Dynamics [72.95483148058378]
We propose to pre-train a general-purpose machine learning model to capture traffic dynamics with only traffic data from NetFlow records.<n>We address challenges such as unifying network feature representations, learning from large unlabeled traffic data volume, and testing on real downstream tasks in DDoS attack detection.
arXiv Detail & Related papers (2024-12-30T00:47:49Z) - MIETT: Multi-Instance Encrypted Traffic Transformer for Encrypted Traffic Classification [59.96233305733875]
Classifying traffic is essential for detecting security threats and optimizing network management.<n>We propose a Multi-Instance Encrypted Traffic Transformer (MIETT) to capture both token-level and packet-level relationships.<n>MIETT achieves results across five datasets, demonstrating its effectiveness in classifying encrypted traffic and understanding complex network behaviors.
arXiv Detail & Related papers (2024-12-19T12:52:53Z) - Lens: A Foundation Model for Network Traffic [19.3652490585798]
Lens is a foundation model for network traffic that leverages the T5 architecture to learn the pre-trained representations from large-scale unlabeled data.
We design a novel loss that combines three distinct tasks: Masked Span Prediction (MSP), Packet Order Prediction (POP), and Homologous Traffic Prediction (HTP)
arXiv Detail & Related papers (2024-02-06T02:45:13Z) - Machine Learning for Encrypted Malicious Traffic Detection: Approaches,
Datasets and Comparative Study [6.267890584151111]
In post-COVID-19 environment, malicious traffic encryption is growing rapidly.
We formulate a universal framework of machine learning based encrypted malicious traffic detection techniques.
We implement and compare 10 encrypted malicious traffic detection algorithms.
arXiv Detail & Related papers (2022-03-17T14:00:55Z) - A Lightweight, Efficient and Explainable-by-Design Convolutional Neural
Network for Internet Traffic Classification [9.365794791156972]
This paper introduces a new Lightweight, Efficient and eXplainable-by-design convolutional neural network (LEXNet) for Internet traffic classification.
LEXNet relies on a new residual block (for lightweight and efficiency purposes) and prototype layer (for explainability)
Based on a commercial-grade dataset, our evaluation shows that LEXNet succeeds to maintain the same accuracy as the best performing state-of-the-art neural network.
arXiv Detail & Related papers (2022-02-11T10:21:34Z) - Automated Machine Learning Techniques for Data Streams [91.3755431537592]
This paper surveys the state-of-the-art open-source AutoML tools, applies them to data collected from streams, and measures how their performance changes over time.
The results show that off-the-shelf AutoML tools can provide satisfactory results but in the presence of concept drift, detection or adaptation techniques have to be applied to maintain the predictive accuracy over time.
arXiv Detail & Related papers (2021-06-14T11:42:46Z) - DoS and DDoS Mitigation Using Variational Autoencoders [15.23225419183423]
We explore the potential of Variational Autoencoders to serve as a component within an intelligent security solution.
Two methods based on the ability of Variational Autoencoders to learn latent representations from network traffic flows are proposed.
arXiv Detail & Related papers (2021-05-14T15:38:40Z) - Deep Learning and Traffic Classification: Lessons learned from a
commercial-grade dataset with hundreds of encrypted and zero-day applications [72.02908263225919]
We share our experience on a commercial-grade DL traffic classification engine.
We identify known applications from encrypted traffic, as well as unknown zero-day applications.
We propose a novel technique, tailored for DL models, that is significantly more accurate and light-weight than the state of the art.
arXiv Detail & Related papers (2021-04-07T15:21:22Z) - Privacy-preserving Traffic Flow Prediction: A Federated Learning
Approach [61.64006416975458]
We propose a privacy-preserving machine learning technique named Federated Learning-based Gated Recurrent Unit neural network algorithm (FedGRU) for traffic flow prediction.
FedGRU differs from current centralized learning methods and updates universal learning models through a secure parameter aggregation mechanism.
It is shown that FedGRU's prediction accuracy is 90.96% higher than the advanced deep learning models.
arXiv Detail & Related papers (2020-03-19T13:07:49Z) - Key Points Estimation and Point Instance Segmentation Approach for Lane
Detection [65.37887088194022]
We propose a traffic line detection method called Point Instance Network (PINet)
The PINet includes several stacked hourglass networks that are trained simultaneously.
The PINet achieves competitive accuracy and false positive on the TuSimple and Culane datasets.
arXiv Detail & Related papers (2020-02-16T15:51:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.