Automated Software Vulnerability Static Code Analysis Using Generative Pre-Trained Transformer Models
- URL: http://arxiv.org/abs/2408.00197v1
- Date: Wed, 31 Jul 2024 23:33:26 GMT
- Title: Automated Software Vulnerability Static Code Analysis Using Generative Pre-Trained Transformer Models
- Authors: Elijah Pelofske, Vincent Urias, Lorie M. Liebrock,
- Abstract summary: Generative Pre-Trained Transformer models have been shown to be surprisingly effective at a variety of natural language processing tasks.
We evaluate the effectiveness of open source GPT models for the task of automatic identification of the presence of vulnerable code syntax.
- Score: 0.8192907805418583
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generative Pre-Trained Transformer models have been shown to be surprisingly effective at a variety of natural language processing tasks -- including generating computer code. We evaluate the effectiveness of open source GPT models for the task of automatic identification of the presence of vulnerable code syntax (specifically targeting C and C++ source code). This task is evaluated on a selection of 36 source code examples from the NIST SARD dataset, which are specifically curated to not contain natural English that indicates the presence, or lack thereof, of a particular vulnerability. The NIST SARD source code dataset contains identified vulnerable lines of source code that are examples of one out of the 839 distinct Common Weakness Enumerations (CWE), allowing for exact quantification of the GPT output classification error rate. A total of 5 GPT models are evaluated, using 10 different inference temperatures and 100 repetitions at each setting, resulting in 5,000 GPT queries per vulnerable source code analyzed. Ultimately, we find that the GPT models that we evaluated are not suitable for fully automated vulnerability scanning because the false positive and false negative rates are too high to likely be useful in practice. However, we do find that the GPT models perform surprisingly well at automated vulnerability detection for some of the test cases, in particular surpassing random sampling, and being able to identify the exact lines of code that are vulnerable albeit at a low success rate. The best performing GPT model result found was Llama-2-70b-chat-hf with inference temperature of 0.1 applied to NIST SARD test case 149165 (which is an example of a buffer overflow vulnerability), which had a binary classification recall score of 1.0 and a precision of 1.0 for correctly and uniquely identifying the vulnerable line of code and the correct CWE number.
Related papers
- Do Neutral Prompts Produce Insecure Code? FormAI-v2 Dataset: Labelling Vulnerabilities in Code Generated by Large Language Models [3.4887856546295333]
This study provides a comparative analysis of state-of-the-art large language models (LLMs)
We analyze how likely they generate vulnerabilities when writing simple C programs using a neutral zero-shot prompt.
arXiv Detail & Related papers (2024-04-29T01:24:14Z) - Shifting the Lens: Detecting Malicious npm Packages using Large Language Models [4.479741014073169]
Existing malicious code detection techniques often suffer from high misclassification rates.
We present SecurityAI, a malicious code review workflow to detect malicious code using ChatGPT.
Our baseline comparison demonstrates a 16% and 9% improvement over static analysis in precision and F1 scores.
arXiv Detail & Related papers (2024-03-18T19:10:12Z) - VGX: Large-Scale Sample Generation for Boosting Learning-Based Software
Vulnerability Analyses [30.65722096096949]
This paper proposes VGX, a new technique aimed for large-scale generation of high-quality vulnerability datasets.
VGX materializes vulnerability-injection code editing in identified contexts using patterns of such edits.
For in-the-wild sample production, VGX generated 150,392 vulnerable samples, from which we randomly chose 10% to assess how much these samples help vulnerability detection, localization, and repair.
arXiv Detail & Related papers (2023-10-24T01:05:00Z) - Zero-Shot Detection of Machine-Generated Codes [83.0342513054389]
This work proposes a training-free approach for the detection of LLMs-generated codes.
We find that existing training-based or zero-shot text detectors are ineffective in detecting code.
Our method exhibits robustness against revision attacks and generalizes well to Java codes.
arXiv Detail & Related papers (2023-10-08T10:08:21Z) - Teaching Large Language Models to Self-Debug [62.424077000154945]
Large language models (LLMs) have achieved impressive performance on code generation.
We propose Self- Debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations.
arXiv Detail & Related papers (2023-04-11T10:43:43Z) - Paraphrasing evades detectors of AI-generated text, but retrieval is an
effective defense [56.077252790310176]
We present a paraphrase generation model (DIPPER) that can paraphrase paragraphs, condition on surrounding context, and control lexical diversity and content reordering.
Using DIPPER to paraphrase text generated by three large language models (including GPT3.5-davinci-003) successfully evades several detectors, including watermarking.
We introduce a simple defense that relies on retrieving semantically-similar generations and must be maintained by a language model API provider.
arXiv Detail & Related papers (2023-03-23T16:29:27Z) - VMCDL: Vulnerability Mining Based on Cascaded Deep Learning Under Source
Control Flow [2.561778620560749]
This paper mainly use the c/c++ source code data of the SARD dataset, process the source code of CWE476, CWE469, CWE516 and CWE570 vulnerability types.
We propose a new cascading deep learning model VMCDL based on source code control flow to effectively detect vulnerabilities.
arXiv Detail & Related papers (2023-03-13T13:58:39Z) - DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability
Curvature [143.5381108333212]
We show that text sampled from an large language model tends to occupy negative curvature regions of the model's log probability function.
We then define a new curvature-based criterion for judging if a passage is generated from a given LLM.
We find DetectGPT is more discriminative than existing zero-shot methods for model sample detection.
arXiv Detail & Related papers (2023-01-26T18:44:06Z) - Fault-Aware Neural Code Rankers [64.41888054066861]
We propose fault-aware neural code rankers that can predict the correctness of a sampled program without executing it.
Our fault-aware rankers can significantly increase the pass@1 accuracy of various code generation models.
arXiv Detail & Related papers (2022-06-04T22:01:05Z) - VELVET: a noVel Ensemble Learning approach to automatically locate
VulnErable sTatements [62.93814803258067]
This paper presents VELVET, a novel ensemble learning approach to locate vulnerable statements in source code.
Our model combines graph-based and sequence-based neural networks to successfully capture the local and global context of a program graph.
VELVET achieves 99.6% and 43.6% top-1 accuracy over synthetic data and real-world data, respectively.
arXiv Detail & Related papers (2021-12-20T22:45:27Z) - Security Vulnerability Detection Using Deep Learning Natural Language
Processing [1.4591078795663772]
We model software vulnerability detection as a natural language processing (NLP) problem with source code treated as texts.
For training and testing, we have built a dataset of over 100,000 files in $C$ programming language with 123 types of vulnerabilities.
Experiments generate the best performance of over 93% accuracy in detecting security vulnerabilities.
arXiv Detail & Related papers (2021-05-06T01:28:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.