VGX: Large-Scale Sample Generation for Boosting Learning-Based Software
Vulnerability Analyses
- URL: http://arxiv.org/abs/2310.15436v2
- Date: Thu, 4 Jan 2024 20:01:55 GMT
- Title: VGX: Large-Scale Sample Generation for Boosting Learning-Based Software
Vulnerability Analyses
- Authors: Yu Nong, Richard Fang, Guangbei Yi, Kunsong Zhao, Xiapu Luo, Feng
Chen, and Haipeng Cai
- Abstract summary: This paper proposes VGX, a new technique aimed for large-scale generation of high-quality vulnerability datasets.
VGX materializes vulnerability-injection code editing in identified contexts using patterns of such edits.
For in-the-wild sample production, VGX generated 150,392 vulnerable samples, from which we randomly chose 10% to assess how much these samples help vulnerability detection, localization, and repair.
- Score: 30.65722096096949
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Accompanying the successes of learning-based defensive software vulnerability
analyses is the lack of large and quality sets of labeled vulnerable program
samples, which impedes further advancement of those defenses. Existing
automated sample generation approaches have shown potentials yet still fall
short of practical expectations due to the high noise in the generated samples.
This paper proposes VGX, a new technique aimed for large-scale generation of
high-quality vulnerability datasets. Given a normal program, VGX identifies the
code contexts in which vulnerabilities can be injected, using a customized
Transformer featured with a new value-flowbased position encoding and
pre-trained against new objectives particularly for learning code structure and
context. Then, VGX materializes vulnerability-injection code editing in the
identified contexts using patterns of such edits obtained from both historical
fixes and human knowledge about real-world vulnerabilities. Compared to four
state-of-the-art (SOTA) baselines (pattern-, Transformer-, GNN-, and
pattern+Transformer-based), VGX achieved 99.09-890.06% higher F1 and
22.45%-328.47% higher label accuracy. For in-the-wild sample production, VGX
generated 150,392 vulnerable samples, from which we randomly chose 10% to
assess how much these samples help vulnerability detection, localization, and
repair. Our results show SOTA techniques for these three application tasks
achieved 19.15-330.80% higher F1, 12.86-19.31% higher top-10 accuracy, and
85.02-99.30% higher top-50 accuracy, respectively, by adding those samples to
their original training data. These samples also helped a SOTA vulnerability
detector discover 13 more real-world vulnerabilities (CVEs) in critical systems
(e.g., Linux kernel) that would be missed by the original model.
Related papers
- CleanVul: Automatic Function-Level Vulnerability Detection in Code Commits Using LLM Heuristics [12.053158610054911]
This paper introduces the first methodology that uses the Large Language Model (LLM) with a enhancement to automatically identify vulnerability-fixing changes from VFCs.
VulSifter was applied to a large-scale study, where we conducted a crawl of 127,063 repositories on GitHub, resulting in the acquisition of 5,352,105 commits.
We then developed CleanVul, a high-quality dataset comprising 11,632 functions using our LLM enhancement approach.
arXiv Detail & Related papers (2024-11-26T09:51:55Z) - Exploring RAG-based Vulnerability Augmentation with LLMs [19.45598962972431]
Large language models (LLMs) are being used for solving various code generation and comprehension tasks.
In this study, we explore three different strategies to augment vulnerabilities, with LLMs, namely Mutation, Injection, and Extension.
Our results show that our injection-based clustering-enhanced RAG method beats the baseline setting (NoAug), Vulgen, and VGX (two SOTA methods), and Random Oversampling (ROS) by 30.80%, 27.48%, 27.93%, and 15.41% in f1-score with 5K generated vulnerable samples.
arXiv Detail & Related papers (2024-08-07T23:22:58Z) - Automated Software Vulnerability Static Code Analysis Using Generative Pre-Trained Transformer Models [0.8192907805418583]
Generative Pre-Trained Transformer models have been shown to be surprisingly effective at a variety of natural language processing tasks.
We evaluate the effectiveness of open source GPT models for the task of automatic identification of the presence of vulnerable code syntax.
arXiv Detail & Related papers (2024-07-31T23:33:26Z) - Vulnerability Detection with Code Language Models: How Far Are We? [40.455600722638906]
PrimeVul is a new dataset for training and evaluating code LMs for vulnerability detection.
It incorporates a novel set of data labeling techniques that achieve comparable label accuracy to human-verified benchmarks.
It also implements a rigorous data de-duplication and chronological data splitting strategy to mitigate data leakage issues.
arXiv Detail & Related papers (2024-03-27T14:34:29Z) - FaultGuard: A Generative Approach to Resilient Fault Prediction in Smart Electrical Grids [53.2306792009435]
FaultGuard is the first framework for fault type and zone classification resilient to adversarial attacks.
We propose a low-complexity fault prediction model and an online adversarial training technique to enhance robustness.
Our model outclasses the state-of-the-art for resilient fault prediction benchmarking, with an accuracy of up to 0.958.
arXiv Detail & Related papers (2024-03-26T08:51:23Z) - Text generation for dataset augmentation in security classification
tasks [55.70844429868403]
This study evaluates the application of natural language text generators to fill this data gap in multiple security-related text classification tasks.
We find substantial benefits for GPT-3 data augmentation strategies in situations with severe limitations on known positive-class samples.
arXiv Detail & Related papers (2023-10-22T22:25:14Z) - An Unbiased Transformer Source Code Learning with Semantic Vulnerability
Graph [3.3598755777055374]
Current vulnerability screening techniques are ineffective at identifying novel vulnerabilities or providing developers with code vulnerability and classification.
To address these issues, we propose a joint multitasked unbiased vulnerability classifier comprising a transformer "RoBERTa" and graph convolution neural network (GCN)
We present a training process utilizing a semantic vulnerability graph (SVG) representation from source code, created by integrating edges from a sequential flow, control flow, and data flow, as well as a novel flow dubbed Poacher Flow (PF)
arXiv Detail & Related papers (2023-04-17T20:54:14Z) - VELVET: a noVel Ensemble Learning approach to automatically locate
VulnErable sTatements [62.93814803258067]
This paper presents VELVET, a novel ensemble learning approach to locate vulnerable statements in source code.
Our model combines graph-based and sequence-based neural networks to successfully capture the local and global context of a program graph.
VELVET achieves 99.6% and 43.6% top-1 accuracy over synthetic data and real-world data, respectively.
arXiv Detail & Related papers (2021-12-20T22:45:27Z) - CC-Cert: A Probabilistic Approach to Certify General Robustness of
Neural Networks [58.29502185344086]
In safety-critical machine learning applications, it is crucial to defend models against adversarial attacks.
It is important to provide provable guarantees for deep learning models against semantically meaningful input transformations.
We propose a new universal probabilistic certification approach based on Chernoff-Cramer bounds.
arXiv Detail & Related papers (2021-09-22T12:46:04Z) - Semantic Perturbations with Normalizing Flows for Improved
Generalization [62.998818375912506]
We show that perturbations in the latent space can be used to define fully unsupervised data augmentations.
We find that our latent adversarial perturbations adaptive to the classifier throughout its training are most effective.
arXiv Detail & Related papers (2021-08-18T03:20:00Z) - How Robust are Randomized Smoothing based Defenses to Data Poisoning? [66.80663779176979]
We present a previously unrecognized threat to robust machine learning models that highlights the importance of training-data quality.
We propose a novel bilevel optimization-based data poisoning attack that degrades the robustness guarantees of certifiably robust classifiers.
Our attack is effective even when the victim trains the models from scratch using state-of-the-art robust training methods.
arXiv Detail & Related papers (2020-12-02T15:30:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.