Dazzle: Using Optimized Generative Adversarial Networks to Address
Security Data Class Imbalance Issue
- URL: http://arxiv.org/abs/2203.11410v1
- Date: Tue, 22 Mar 2022 01:43:06 GMT
- Title: Dazzle: Using Optimized Generative Adversarial Networks to Address
Security Data Class Imbalance Issue
- Authors: Rui Shu, Tianpei Xia, Laurie Williams, Tim Menzies
- Abstract summary: We introduce an approach called Dazzle which is an optimized version of conditional Wasserstein Generative Adrial Networks with gradient penalty (cWGAN-GP)
We use Dazzle to generate minority class samples to resample the original imbalanced training dataset.
We show that Dazzle is practical to use and demonstrates promising improvement over existing state-of-the-art oversampling techniques.
- Score: 35.0689225703137
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Background: Machine learning techniques have been widely used and demonstrate
promising performance in many software security tasks such as software
vulnerability prediction. However, the class ratio within software
vulnerability datasets is often highly imbalanced (since the percentage of
observed vulnerability is usually very low). Goal: To help security
practitioners address software security data class imbalanced issues and
further help build better prediction models with resampled datasets. Method: We
introduce an approach called Dazzle which is an optimized version of
conditional Wasserstein Generative Adversarial Networks with gradient penalty
(cWGAN-GP). Dazzle explores the architecture hyperparameters of cWGAN-GP with a
novel optimizer called Bayesian Optimization. We use Dazzle to generate
minority class samples to resample the original imbalanced training dataset.
Results: We evaluate Dazzle with three software security datasets, i.e., Moodle
vulnerable files, Ambari bug reports, and JavaScript function code. We show
that Dazzle is practical to use and demonstrates promising improvement over
existing state-of-the-art oversampling techniques such as SMOTE (e.g., with an
average of about 60% improvement rate over SMOTE in recall among all datasets).
Conclusion: Based on this study, we would suggest the use of optimized GANs as
an alternative method for security vulnerability data class imbalanced issues.
Related papers
- GShield: Mitigating Poisoning Attacks in Federated Learning [2.6260952524631787]
Federated Learning (FL) has recently emerged as a revolutionary approach to collaborative training Machine Learning models.<n>It enables decentralized model training while preserving data privacy, but its distributed nature makes it highly vulnerable to a severe attack known as Data Poisoning.<n>We present a novel defense mechanism called GShield, designed to detect and mitigate malicious and low-quality updates.
arXiv Detail & Related papers (2025-12-22T11:29:28Z) - Cyberattack Detection in Critical Infrastructure and Supply Chains [0.0]
Intrusion Detection Systems (IDS) are deployed to counter cyberattacks.<n>IDS effectively detects attacks based on the known signatures and patterns, Zero-day attacks go undetected.<n>To overcome this drawback in IDS, the integration of a Dense Neural Network (DNN) with Data Augmentation is proposed.<n>It makes IDS intelligent and enables it to self-learn with high accuracy when a novel attack is encountered.
arXiv Detail & Related papers (2025-10-21T20:38:58Z) - Smart Cuts: Enhance Active Learning for Vulnerability Detection by Pruning Bad Seeds [15.490968013867562]
Vulnerability detection is crucial for identifying security weaknesses in software systems.<n>This paper proposes a novel dataset maps-empowered approach that identifies and mitigates hard-to-learn outliers.<n>Our approach can categorize training examples based on learning difficulty and integrate this information into an active learning framework.
arXiv Detail & Related papers (2025-06-25T13:50:21Z) - SCIZOR: A Self-Supervised Approach to Data Curation for Large-Scale Imitation Learning [30.34323856102674]
Imitation learning advances robot capabilities by enabling the acquisition of diverse behaviors from human demonstrations.<n>Existing robotic curation approaches rely on costly manual annotations and perform curation at a coarse granularity.<n>We introduce SCIZOR, a self-supervised data curation framework that filters out low-quality state-action pairs to improve the performance of imitation learning policies.
arXiv Detail & Related papers (2025-05-28T17:45:05Z) - Asymmetric Co-Training for Source-Free Few-Shot Domain Adaptation [5.611768906855499]
We propose an asymmetric co-training (ACT) method specifically designed for the SFFSDA scenario.
We use a two-step optimization process to train the target model.
Our findings suggest that adapting a source pre-trained model using only a small amount of labeled target data offers a practical and dependable solution.
arXiv Detail & Related papers (2025-02-20T02:58:45Z) - Malicious URL Detection using optimized Hist Gradient Boosting Classifier based on grid search method [0.0]
Trusting the accuracy of data inputted on online platforms can be difficult due to the possibility of malicious websites gathering information for unlawful reasons.
To detect the risk posed by malicious websites, it is proposed to utilize Machine Learning (ML)-based techniques.
The dataset used contains 1781 records of malicious benign website data with 13 features.
arXiv Detail & Related papers (2024-06-12T11:16:30Z) - An Unbiased Transformer Source Code Learning with Semantic Vulnerability
Graph [3.3598755777055374]
Current vulnerability screening techniques are ineffective at identifying novel vulnerabilities or providing developers with code vulnerability and classification.
To address these issues, we propose a joint multitasked unbiased vulnerability classifier comprising a transformer "RoBERTa" and graph convolution neural network (GCN)
We present a training process utilizing a semantic vulnerability graph (SVG) representation from source code, created by integrating edges from a sequential flow, control flow, and data flow, as well as a novel flow dubbed Poacher Flow (PF)
arXiv Detail & Related papers (2023-04-17T20:54:14Z) - Uncertainty-Aware Source-Free Adaptive Image Super-Resolution with Wavelet Augmentation Transformer [60.31021888394358]
Unsupervised Domain Adaptation (UDA) can effectively address domain gap issues in real-world image Super-Resolution (SR)
We propose a SOurce-free Domain Adaptation framework for image SR (SODA-SR) to address this issue, i.e., adapt a source-trained model to a target domain with only unlabeled target data.
arXiv Detail & Related papers (2023-03-31T03:14:44Z) - MAPS: A Noise-Robust Progressive Learning Approach for Source-Free
Domain Adaptive Keypoint Detection [76.97324120775475]
Cross-domain keypoint detection methods always require accessing the source data during adaptation.
This paper considers source-free domain adaptive keypoint detection, where only the well-trained source model is provided to the target domain.
arXiv Detail & Related papers (2023-02-09T12:06:08Z) - SRoUDA: Meta Self-training for Robust Unsupervised Domain Adaptation [25.939292305808934]
Unsupervised domain adaptation (UDA) can transfer knowledge learned from rich-label dataset to unlabeled target dataset.
In this paper, we present a new meta self-training pipeline, named SRoUDA, for improving adversarial robustness of UDA models.
arXiv Detail & Related papers (2022-12-12T14:25:40Z) - Gradient-based Data Subversion Attack Against Binary Classifiers [9.414651358362391]
In this work, we focus on label contamination attack in which an attacker poisons the labels of data to compromise the functionality of the system.
We exploit the gradients of a differentiable convex loss function with respect to the predicted label as a warm-start and formulate different strategies to find a set of data instances to contaminate.
Our experiments show that the proposed approach outperforms the baselines and is computationally efficient.
arXiv Detail & Related papers (2021-05-31T09:04:32Z) - Negative Data Augmentation [127.28042046152954]
We show that negative data augmentation samples provide information on the support of the data distribution.
We introduce a new GAN training objective where we use NDA as an additional source of synthetic data for the discriminator.
Empirically, models trained with our method achieve improved conditional/unconditional image generation along with improved anomaly detection capabilities.
arXiv Detail & Related papers (2021-02-09T20:28:35Z) - How Robust are Randomized Smoothing based Defenses to Data Poisoning? [66.80663779176979]
We present a previously unrecognized threat to robust machine learning models that highlights the importance of training-data quality.
We propose a novel bilevel optimization-based data poisoning attack that degrades the robustness guarantees of certifiably robust classifiers.
Our attack is effective even when the victim trains the models from scratch using state-of-the-art robust training methods.
arXiv Detail & Related papers (2020-12-02T15:30:21Z) - Diversity inducing Information Bottleneck in Model Ensembles [73.80615604822435]
In this paper, we target the problem of generating effective ensembles of neural networks by encouraging diversity in prediction.
We explicitly optimize a diversity inducing adversarial loss for learning latent variables and thereby obtain diversity in the output predictions necessary for modeling multi-modal data.
Compared to the most competitive baselines, we show significant improvements in classification accuracy, under a shift in the data distribution.
arXiv Detail & Related papers (2020-03-10T03:10:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.