STAN: Synthetic Network Traffic Generation with Generative Neural Models
- URL: http://arxiv.org/abs/2009.12740v2
- Date: Tue, 3 Aug 2021 02:48:04 GMT
- Title: STAN: Synthetic Network Traffic Generation with Generative Neural Models
- Authors: Shengzhe Xu, Manish Marwah, Martin Arlitt, Naren Ramakrishnan
- Abstract summary: This paper presents STAN (Synthetic network Traffic generation with Autoregressive Neural models), a tool to generate realistic synthetic network traffic datasets.
Our novel neural architecture captures both temporal dependencies and dependence between attributes at any given time.
We evaluate the performance of STAN in terms of the quality of data generated, by training it on both a simulated dataset and a real network traffic data set.
- Score: 10.54843182184416
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep learning models have achieved great success in recent years but progress
in some domains like cybersecurity is stymied due to a paucity of realistic
datasets. Organizations are reluctant to share such data, even internally, due
to privacy reasons. An alternative is to use synthetically generated data but
existing methods are limited in their ability to capture complex dependency
structures, between attributes and across time. This paper presents STAN
(Synthetic network Traffic generation with Autoregressive Neural models), a
tool to generate realistic synthetic network traffic datasets for subsequent
downstream applications. Our novel neural architecture captures both temporal
dependencies and dependence between attributes at any given time. It integrates
convolutional neural layers with mixture density neural layers and softmax
layers, and models both continuous and discrete variables. We evaluate the
performance of STAN in terms of the quality of data generated, by training it
on both a simulated dataset and a real network traffic data set. Finally, to
answer the question - can real network traffic data be substituted with
synthetic data to train models of comparable accuracy? We train two anomaly
detection models based on self-supervision. The results show only a small
decline in the accuracy of models trained solely on synthetic data. While
current results are encouraging in terms of quality of data generated and
absence of any obvious data leakage from training data, in the future we plan
to further validate this fact by conducting privacy attacks on the generated
data. Other future work includes validating capture of long term dependencies
and making model training
Related papers
- How to Synthesize Text Data without Model Collapse? [37.219627817995054]
Model collapse in synthetic data indicates that iterative training on self-generated data leads to a gradual decline in performance.
We propose token editing on human-produced data to obtain semi-synthetic data.
arXiv Detail & Related papers (2024-12-19T09:43:39Z) - Collapse or Thrive? Perils and Promises of Synthetic Data in a Self-Generating World [19.266191284270793]
generative machine learning models are pretrained on web-scale datasets containing data generated by earlier models.
Some prior work warns of "model collapse" as the web is overwhelmed by synthetic data.
We report experiments on three ways of using data (training-workflows) across three generative model task-settings.
arXiv Detail & Related papers (2024-10-22T05:49:24Z) - Self-Improving Diffusion Models with Synthetic Data [12.597035060380001]
Self-IM diffusion models with Synthetic data (SIMS) is a new training concept for diffusion models.
SIMS uses self-synthesized data to provide negative guidance during the generation process.
It is the first prophylactic generative AI algorithm that can be iteratively trained on self-generated synthetic data without going MAD.
arXiv Detail & Related papers (2024-08-29T08:12:18Z) - Online Data Augmentation for Forecasting with Deep Learning [0.33554367023486936]
This work introduces an online data augmentation framework that generates synthetic samples during the training of neural networks.
We maintain a balanced representation between real and synthetic data throughout the training process.
Experiments suggest that online data augmentation leads to better forecasting performance compared to offline data augmentation or no augmentation approaches.
arXiv Detail & Related papers (2024-04-25T17:16:13Z) - Towards Theoretical Understandings of Self-Consuming Generative Models [56.84592466204185]
This paper tackles the emerging challenge of training generative models within a self-consuming loop.
We construct a theoretical framework to rigorously evaluate how this training procedure impacts the data distributions learned by future models.
We present results for kernel density estimation, delivering nuanced insights such as the impact of mixed data training on error propagation.
arXiv Detail & Related papers (2024-02-19T02:08:09Z) - Learning Defect Prediction from Unrealistic Data [57.53586547895278]
Pretrained models of code have become popular choices for code understanding and generation tasks.
Such models tend to be large and require commensurate volumes of training data.
It has become popular to train models with far larger but less realistic datasets, such as functions with artificially injected bugs.
Models trained on such data tend to only perform well on similar data, while underperforming on real world programs.
arXiv Detail & Related papers (2023-11-02T01:51:43Z) - On the Stability of Iterative Retraining of Generative Models on their own Data [56.153542044045224]
We study the impact of training generative models on mixed datasets.
We first prove the stability of iterative training under the condition that the initial generative models approximate the data distribution well enough.
We empirically validate our theory on both synthetic and natural images by iteratively training normalizing flows and state-of-the-art diffusion models.
arXiv Detail & Related papers (2023-09-30T16:41:04Z) - Exploring the Effectiveness of Dataset Synthesis: An application of
Apple Detection in Orchards [68.95806641664713]
We explore the usability of Stable Diffusion 2.1-base for generating synthetic datasets of apple trees for object detection.
We train a YOLOv5m object detection model to predict apples in a real-world apple detection dataset.
Results demonstrate that the model trained on generated data is slightly underperforming compared to a baseline model trained on real-world images.
arXiv Detail & Related papers (2023-06-20T09:46:01Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Online Evolutionary Neural Architecture Search for Multivariate
Non-Stationary Time Series Forecasting [72.89994745876086]
This work presents the Online Neuro-Evolution-based Neural Architecture Search (ONE-NAS) algorithm.
ONE-NAS is a novel neural architecture search method capable of automatically designing and dynamically training recurrent neural networks (RNNs) for online forecasting tasks.
Results demonstrate that ONE-NAS outperforms traditional statistical time series forecasting methods.
arXiv Detail & Related papers (2023-02-20T22:25:47Z) - MLReal: Bridging the gap between training on synthetic data and real
data applications in machine learning [1.9852463786440129]
We describe a novel approach to enhance supervised training on synthetic data with real data features.
In the training stage, the input data are from the synthetic domain and the auto-correlated data are from the real domain.
In the inference/application stage, the input data are from the real subset domain and the mean of the autocorrelated sections are from the synthetic data subset domain.
arXiv Detail & Related papers (2021-09-11T14:43:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.