Related papers: Hybrid Deep Learning Model using SPCAGAN Augmentation for Insider Threat Analysis

Hybrid Deep Learning Model using SPCAGAN Augmentation for Insider Threat Analysis

URL: http://arxiv.org/abs/2203.02855v1
Date: Sun, 6 Mar 2022 02:08:48 GMT
Title: Hybrid Deep Learning Model using SPCAGAN Augmentation for Insider Threat Analysis
Authors: R G Gayathri, Atul Sajjanhar, Yong Xiang
Abstract summary: Anomaly detection using deep learning requires comprehensive data, but insider threat data is not readily available due to confidentiality concerns. We propose a linear manifold learning-based generative adversarial network, SPCAGAN, that takes input from heterogeneous data sources. We show that our proposed approach has a lower error, is more accurate, and generates substantially superior synthetic insider threat data than previous models.
Score: 7.576808824987132
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Cyberattacks from within an organization's trusted entities are known as insider threats. Anomaly detection using deep learning requires comprehensive data, but insider threat data is not readily available due to confidentiality concerns of organizations. Therefore, there arises demand to generate synthetic data to explore enhanced approaches for threat analysis. We propose a linear manifold learning-based generative adversarial network, SPCAGAN, that takes input from heterogeneous data sources and adds a novel loss function to train the generator to produce high-quality data that closely resembles the original data distribution. Furthermore, we introduce a deep learning-based hybrid model for insider threat analysis. We provide extensive experiments for data synthesis, anomaly detection, adversarial robustness, and synthetic data quality analysis using benchmark datasets. In this context, empirical comparisons show that GAN-based oversampling is competitive with numerous typical oversampling regimes. For synthetic data generation, our SPCAGAN model overcame the problem of mode collapse and converged faster than previous GAN models. Results demonstrate that our proposed approach has a lower error, is more accurate, and generates substantially superior synthetic insider threat data than previous models.

Related papers

Less is More: Adaptive Coverage for Synthetic Training Data [20.136698279893857]
This study introduces a novel sampling algorithm, based on the maximum coverage problem, to select a representative subset from a synthetically generated dataset. Our results demonstrate that training a classifier on this contextually sampled subset achieves superior performance compared to training on the entire dataset.
arXiv Detail & Related papers (2025-04-20T06:45:16Z)
A Theoretical Perspective: How to Prevent Model Collapse in Self-consuming Training Loops [55.07063067759609]
High-quality data is essential for training large generative models, yet the vast reservoir of real data available online has become nearly depleted. Models increasingly generate their own data for further training, forming Self-consuming Training Loops (STLs) Some models degrade or even collapse, while others successfully avoid these failures, leaving a significant gap in theoretical understanding.
arXiv Detail & Related papers (2025-02-26T06:18:13Z)
Debiasing Synthetic Data Generated by Deep Generative Models [40.165159490379146]
Deep generative models (DGMs) for synthetic data generation induce bias and imprecision in synthetic data analyses. We propose a new strategy that targets synthetic data created by DGMs for specific data analyses. Our approach accounts for biases, enhances convergence rates, and facilitates the calculation of estimators with easily approximated large sample variances.
arXiv Detail & Related papers (2024-11-06T19:24:34Z)
zGAN: An Outlier-focused Generative Adversarial Network For Realistic Synthetic Data Generation [0.0]
"Black swans" have posed a challenge to performance of classical machine learning models. This article provides an overview of the zGAN model architecture developed for the purpose of generating synthetic data with outlier characteristics. It shows promising results on realistic synthetic data generation, as well as uplift capabilities vis-a-vis model performance.
arXiv Detail & Related papers (2024-10-28T07:55:11Z)
Synthetic Data Generation in Cybersecurity: A Comparative Analysis [0.0]
GAN-based methods, particularly CTGAN and CopulaGAN, outperform non-AI and conventional AI approaches in terms of fidelity and utility. This research contributes to the field by offering the first comparative evaluation of these methods specifically for cybersecurity network traffic data.
arXiv Detail & Related papers (2024-10-18T14:19:25Z)
Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models [89.88010750772413]
Synthetic data has been proposed as a solution to address the issue of high-quality data scarcity in the training of large language models (LLMs) Our work delves into these specific flaws associated with question-answer (Q-A) pairs, a prevalent type of synthetic data, and presents a method based on unlearning techniques to mitigate these flaws. Our work has yielded key insights into the effective use of synthetic data, aiming to promote more robust and efficient LLM training.
arXiv Detail & Related papers (2024-06-18T08:38:59Z)
Synthetic Oversampling: Theory and A Practical Approach Using LLMs to Address Data Imbalance [16.047084318753377]
Imbalanced data and spurious correlations are common challenges in machine learning and data science. Oversampling, which artificially increases the number of instances in the underrepresented classes, has been widely adopted to tackle these challenges. We introduce OPAL, a systematic oversampling approach that leverages the capabilities of large language models to generate high-quality synthetic data for minority groups.
arXiv Detail & Related papers (2024-06-05T21:24:26Z)
Let's Synthesize Step by Step: Iterative Dataset Synthesis with Large Language Models by Extrapolating Errors from Small Models [69.76066070227452]
*Data Synthesis* is a promising way to train a small model with very little labeled data. We propose *Synthesis Step by Step* (**S3**), a data synthesis framework that shrinks this distribution gap. Our approach improves the performance of a small model by reducing the gap between the synthetic dataset and the real data.
arXiv Detail & Related papers (2023-10-20T17:14:25Z)
Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task. We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z)
Deceive D: Adaptive Pseudo Augmentation for GAN Training with Limited Data [125.7135706352493]
Generative adversarial networks (GANs) typically require ample data for training in order to synthesize high-fidelity images. Recent studies have shown that training GANs with limited data remains formidable due to discriminator overfitting. This paper introduces a novel strategy called Adaptive Pseudo Augmentation (APA) to encourage healthy competition between the generator and the discriminator.
arXiv Detail & Related papers (2021-11-12T18:13:45Z)
Negative Data Augmentation [127.28042046152954]
We show that negative data augmentation samples provide information on the support of the data distribution. We introduce a new GAN training objective where we use NDA as an additional source of synthetic data for the discriminator. Empirically, models trained with our method achieve improved conditional/unconditional image generation along with improved anomaly detection capabilities.
arXiv Detail & Related papers (2021-02-09T20:28:35Z)
Differentially Private Synthetic Medical Data Generation using Convolutional GANs [7.2372051099165065]
We develop a differentially private framework for synthetic data generation using R'enyi differential privacy. Our approach builds on convolutional autoencoders and convolutional generative adversarial networks to preserve some of the critical characteristics of the generated synthetic data. We demonstrate that our model outperforms existing state-of-the-art models under the same privacy budget.
arXiv Detail & Related papers (2020-12-22T01:03:49Z)
Firearm Detection via Convolutional Neural Networks: Comparing a Semantic Segmentation Model Against End-to-End Solutions [68.8204255655161]
Threat detection of weapons and aggressive behavior from live video can be used for rapid detection and prevention of potentially deadly incidents. One way for achieving this is through the use of artificial intelligence and, in particular, machine learning for image analysis. We compare a traditional monolithic end-to-end deep learning model and a previously proposed model based on an ensemble of simpler neural networks detecting fire-weapons via semantic segmentation.
arXiv Detail & Related papers (2020-12-17T15:19:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.