STAN: Synthetic Network Traffic Generation with Generative Neural Models
- URL: http://arxiv.org/abs/2009.12740v2
- Date: Tue, 3 Aug 2021 02:48:04 GMT
- Title: STAN: Synthetic Network Traffic Generation with Generative Neural Models
- Authors: Shengzhe Xu, Manish Marwah, Martin Arlitt, Naren Ramakrishnan
- Abstract summary: This paper presents STAN (Synthetic network Traffic generation with Autoregressive Neural models), a tool to generate realistic synthetic network traffic datasets.
Our novel neural architecture captures both temporal dependencies and dependence between attributes at any given time.
We evaluate the performance of STAN in terms of the quality of data generated, by training it on both a simulated dataset and a real network traffic data set.
- Score: 10.54843182184416
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep learning models have achieved great success in recent years but progress
in some domains like cybersecurity is stymied due to a paucity of realistic
datasets. Organizations are reluctant to share such data, even internally, due
to privacy reasons. An alternative is to use synthetically generated data but
existing methods are limited in their ability to capture complex dependency
structures, between attributes and across time. This paper presents STAN
(Synthetic network Traffic generation with Autoregressive Neural models), a
tool to generate realistic synthetic network traffic datasets for subsequent
downstream applications. Our novel neural architecture captures both temporal
dependencies and dependence between attributes at any given time. It integrates
convolutional neural layers with mixture density neural layers and softmax
layers, and models both continuous and discrete variables. We evaluate the
performance of STAN in terms of the quality of data generated, by training it
on both a simulated dataset and a real network traffic data set. Finally, to
answer the question - can real network traffic data be substituted with
synthetic data to train models of comparable accuracy? We train two anomaly
detection models based on self-supervision. The results show only a small
decline in the accuracy of models trained solely on synthetic data. While
current results are encouraging in terms of quality of data generated and
absence of any obvious data leakage from training data, in the future we plan
to further validate this fact by conducting privacy attacks on the generated
data. Other future work includes validating capture of long term dependencies
and making model training
Related papers
- Self-Improving Diffusion Models with Synthetic Data [12.597035060380001]
Self-IM diffusion models with Synthetic data (SIMS) is a new training concept for diffusion models.
SIMS uses self-synthesized data to provide negative guidance during the generation process.
It is the first prophylactic generative AI algorithm that can be iteratively trained on self-generated synthetic data without going MAD.
arXiv Detail & Related papers (2024-08-29T08:12:18Z) - Mind the Gap Between Synthetic and Real: Utilizing Transfer Learning to Probe the Boundaries of Stable Diffusion Generated Data [2.6016285265085526]
Student models show a significant drop in accuracy compared to models trained on real data.
By training these layers using either real or synthetic data, we reveal that the drop mainly stems from the model's final layers.
Our results suggest an improved trade-off between the amount of real training data used and the model's accuracy.
arXiv Detail & Related papers (2024-05-06T07:51:13Z) - Towards Theoretical Understandings of Self-Consuming Generative Models [56.84592466204185]
This paper tackles the emerging challenge of training generative models within a self-consuming loop.
We construct a theoretical framework to rigorously evaluate how this training procedure impacts the data distributions learned by future models.
We present results for kernel density estimation, delivering nuanced insights such as the impact of mixed data training on error propagation.
arXiv Detail & Related papers (2024-02-19T02:08:09Z) - Learning Defect Prediction from Unrealistic Data [57.53586547895278]
Pretrained models of code have become popular choices for code understanding and generation tasks.
Such models tend to be large and require commensurate volumes of training data.
It has become popular to train models with far larger but less realistic datasets, such as functions with artificially injected bugs.
Models trained on such data tend to only perform well on similar data, while underperforming on real world programs.
arXiv Detail & Related papers (2023-11-02T01:51:43Z) - On the Stability of Iterative Retraining of Generative Models on their own Data [56.153542044045224]
We study the impact of training generative models on mixed datasets.
We first prove the stability of iterative training under the condition that the initial generative models approximate the data distribution well enough.
We empirically validate our theory on both synthetic and natural images by iteratively training normalizing flows and state-of-the-art diffusion models.
arXiv Detail & Related papers (2023-09-30T16:41:04Z) - Exploring the Effectiveness of Dataset Synthesis: An application of
Apple Detection in Orchards [68.95806641664713]
We explore the usability of Stable Diffusion 2.1-base for generating synthetic datasets of apple trees for object detection.
We train a YOLOv5m object detection model to predict apples in a real-world apple detection dataset.
Results demonstrate that the model trained on generated data is slightly underperforming compared to a baseline model trained on real-world images.
arXiv Detail & Related papers (2023-06-20T09:46:01Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - A New Benchmark: On the Utility of Synthetic Data with Blender for Bare
Supervised Learning and Downstream Domain Adaptation [42.2398858786125]
Deep learning in computer vision has achieved great success with the price of large-scale labeled training data.
The uncontrollable data collection process produces non-IID training and test data, where undesired duplication may exist.
To circumvent them, an alternative is to generate synthetic data via 3D rendering with domain randomization.
arXiv Detail & Related papers (2023-03-16T09:03:52Z) - Online Evolutionary Neural Architecture Search for Multivariate
Non-Stationary Time Series Forecasting [72.89994745876086]
This work presents the Online Neuro-Evolution-based Neural Architecture Search (ONE-NAS) algorithm.
ONE-NAS is a novel neural architecture search method capable of automatically designing and dynamically training recurrent neural networks (RNNs) for online forecasting tasks.
Results demonstrate that ONE-NAS outperforms traditional statistical time series forecasting methods.
arXiv Detail & Related papers (2023-02-20T22:25:47Z) - MLReal: Bridging the gap between training on synthetic data and real
data applications in machine learning [1.9852463786440129]
We describe a novel approach to enhance supervised training on synthetic data with real data features.
In the training stage, the input data are from the synthetic domain and the auto-correlated data are from the real domain.
In the inference/application stage, the input data are from the real subset domain and the mean of the autocorrelated sections are from the synthetic data subset domain.
arXiv Detail & Related papers (2021-09-11T14:43:34Z) - Synthesizing Irreproducibility in Deep Networks [2.28438857884398]
Modern day deep networks suffer from irreproducibility (also referred to as nondeterminism or underspecification)
We show that even with a single nonlinearity and for very simple data and models, irreproducibility occurs.
Model complexity and the choice of nonlinearity also play significant roles in making deep models irreproducible.
arXiv Detail & Related papers (2021-02-21T21:51:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.