Related papers: STAN: Synthetic Network Traffic Generation with Generative Neural Models

STAN: Synthetic Network Traffic Generation with Generative Neural Models

URL: http://arxiv.org/abs/2009.12740v2
Date: Tue, 3 Aug 2021 02:48:04 GMT
Title: STAN: Synthetic Network Traffic Generation with Generative Neural Models
Authors: Shengzhe Xu, Manish Marwah, Martin Arlitt, Naren Ramakrishnan
Abstract summary: This paper presents STAN (Synthetic network Traffic generation with Autoregressive Neural models), a tool to generate realistic synthetic network traffic datasets. Our novel neural architecture captures both temporal dependencies and dependence between attributes at any given time. We evaluate the performance of STAN in terms of the quality of data generated, by training it on both a simulated dataset and a real network traffic data set.
Score: 10.54843182184416
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deep learning models have achieved great success in recent years but progress in some domains like cybersecurity is stymied due to a paucity of realistic datasets. Organizations are reluctant to share such data, even internally, due to privacy reasons. An alternative is to use synthetically generated data but existing methods are limited in their ability to capture complex dependency structures, between attributes and across time. This paper presents STAN (Synthetic network Traffic generation with Autoregressive Neural models), a tool to generate realistic synthetic network traffic datasets for subsequent downstream applications. Our novel neural architecture captures both temporal dependencies and dependence between attributes at any given time. It integrates convolutional neural layers with mixture density neural layers and softmax layers, and models both continuous and discrete variables. We evaluate the performance of STAN in terms of the quality of data generated, by training it on both a simulated dataset and a real network traffic data set. Finally, to answer the question - can real network traffic data be substituted with synthetic data to train models of comparable accuracy? We train two anomaly detection models based on self-supervision. The results show only a small decline in the accuracy of models trained solely on synthetic data. While current results are encouraging in terms of quality of data generated and absence of any obvious data leakage from training data, in the future we plan to further validate this fact by conducting privacy attacks on the generated data. Other future work includes validating capture of long term dependencies and making model training

Related papers

Scaling Laws of Synthetic Data for Language Models [132.67350443447611]
We introduce SynthLLM, a scalable framework that transforms pre-training corpora into diverse, high-quality synthetic datasets. Our approach achieves this by automatically extracting and recombining high-level concepts across multiple documents using a graph algorithm.
arXiv Detail & Related papers (2025-03-25T11:07:12Z)
How to Synthesize Text Data without Model Collapse? [37.219627817995054]
Model collapse in synthetic data indicates that iterative training on self-generated data leads to a gradual decline in performance. We propose token editing on human-produced data to obtain semi-synthetic data.
arXiv Detail & Related papers (2024-12-19T09:43:39Z)
Collapse or Thrive? Perils and Promises of Synthetic Data in a Self-Generating World [19.266191284270793]
generative machine learning models are pretrained on web-scale datasets containing data generated by earlier models. Some prior work warns of "model collapse" as the web is overwhelmed by synthetic data. We report experiments on three ways of using data (training-workflows) across three generative model task-settings.
arXiv Detail & Related papers (2024-10-22T05:49:24Z)
Self-Improving Diffusion Models with Synthetic Data [12.597035060380001]
Self-IM diffusion models with Synthetic data (SIMS) is a new training concept for diffusion models. SIMS uses self-synthesized data to provide negative guidance during the generation process. It is the first prophylactic generative AI algorithm that can be iteratively trained on self-generated synthetic data without going MAD.
arXiv Detail & Related papers (2024-08-29T08:12:18Z)
Mind the Gap Between Synthetic and Real: Utilizing Transfer Learning to Probe the Boundaries of Stable Diffusion Generated Data [2.6016285265085526]
Student models show a significant drop in accuracy compared to models trained on real data. By training these layers using either real or synthetic data, we reveal that the drop mainly stems from the model's final layers. Our results suggest an improved trade-off between the amount of real training data used and the model's accuracy.
arXiv Detail & Related papers (2024-05-06T07:51:13Z)
Online Data Augmentation for Forecasting with Deep Learning [0.33554367023486936]
This work introduces an online data augmentation framework that generates synthetic samples during the training of neural networks. We maintain a balanced representation between real and synthetic data throughout the training process. Experiments suggest that online data augmentation leads to better forecasting performance compared to offline data augmentation or no augmentation approaches.
arXiv Detail & Related papers (2024-04-25T17:16:13Z)
Towards Theoretical Understandings of Self-Consuming Generative Models [56.84592466204185]
This paper tackles the emerging challenge of training generative models within a self-consuming loop. We construct a theoretical framework to rigorously evaluate how this training procedure impacts the data distributions learned by future models. We present results for kernel density estimation, delivering nuanced insights such as the impact of mixed data training on error propagation.
arXiv Detail & Related papers (2024-02-19T02:08:09Z)
Learning Defect Prediction from Unrealistic Data [57.53586547895278]
Pretrained models of code have become popular choices for code understanding and generation tasks. Such models tend to be large and require commensurate volumes of training data. It has become popular to train models with far larger but less realistic datasets, such as functions with artificially injected bugs. Models trained on such data tend to only perform well on similar data, while underperforming on real world programs.
arXiv Detail & Related papers (2023-11-02T01:51:43Z)
On the Stability of Iterative Retraining of Generative Models on their own Data [56.153542044045224]
We study the impact of training generative models on mixed datasets. We first prove the stability of iterative training under the condition that the initial generative models approximate the data distribution well enough. We empirically validate our theory on both synthetic and natural images by iteratively training normalizing flows and state-of-the-art diffusion models.
arXiv Detail & Related papers (2023-09-30T16:41:04Z)
Exploring the Effectiveness of Dataset Synthesis: An application of Apple Detection in Orchards [68.95806641664713]
We explore the usability of Stable Diffusion 2.1-base for generating synthetic datasets of apple trees for object detection. We train a YOLOv5m object detection model to predict apples in a real-world apple detection dataset. Results demonstrate that the model trained on generated data is slightly underperforming compared to a baseline model trained on real-world images.
arXiv Detail & Related papers (2023-06-20T09:46:01Z)
Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task. We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z)
A New Benchmark: On the Utility of Synthetic Data with Blender for Bare Supervised Learning and Downstream Domain Adaptation [42.2398858786125]
Deep learning in computer vision has achieved great success with the price of large-scale labeled training data. The uncontrollable data collection process produces non-IID training and test data, where undesired duplication may exist. To circumvent them, an alternative is to generate synthetic data via 3D rendering with domain randomization.
arXiv Detail & Related papers (2023-03-16T09:03:52Z)
Online Evolutionary Neural Architecture Search for Multivariate Non-Stationary Time Series Forecasting [72.89994745876086]
This work presents the Online Neuro-Evolution-based Neural Architecture Search (ONE-NAS) algorithm. ONE-NAS is a novel neural architecture search method capable of automatically designing and dynamically training recurrent neural networks (RNNs) for online forecasting tasks. Results demonstrate that ONE-NAS outperforms traditional statistical time series forecasting methods.
arXiv Detail & Related papers (2023-02-20T22:25:47Z)
MLReal: Bridging the gap between training on synthetic data and real data applications in machine learning [1.9852463786440129]
We describe a novel approach to enhance supervised training on synthetic data with real data features. In the training stage, the input data are from the synthetic domain and the auto-correlated data are from the real domain. In the inference/application stage, the input data are from the real subset domain and the mean of the autocorrelated sections are from the synthetic data subset domain.
arXiv Detail & Related papers (2021-09-11T14:43:34Z)
Synthesizing Irreproducibility in Deep Networks [2.28438857884398]
Modern day deep networks suffer from irreproducibility (also referred to as nondeterminism or underspecification) We show that even with a single nonlinearity and for very simple data and models, irreproducibility occurs. Model complexity and the choice of nonlinearity also play significant roles in making deep models irreproducible.
arXiv Detail & Related papers (2021-02-21T21:51:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.