Related papers: Automated Software Vulnerability Static Code Analysis Using Generative Pre-Trained Transformer Models

Automated Software Vulnerability Static Code Analysis Using Generative Pre-Trained Transformer Models

URL: http://arxiv.org/abs/2408.00197v1
Date: Wed, 31 Jul 2024 23:33:26 GMT
Title: Automated Software Vulnerability Static Code Analysis Using Generative Pre-Trained Transformer Models
Authors: Elijah Pelofske, Vincent Urias, Lorie M. Liebrock,
Abstract summary: Generative Pre-Trained Transformer models have been shown to be surprisingly effective at a variety of natural language processing tasks. We evaluate the effectiveness of open source GPT models for the task of automatic identification of the presence of vulnerable code syntax.
Score: 0.8192907805418583
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Generative Pre-Trained Transformer models have been shown to be surprisingly effective at a variety of natural language processing tasks -- including generating computer code. We evaluate the effectiveness of open source GPT models for the task of automatic identification of the presence of vulnerable code syntax (specifically targeting C and C++ source code). This task is evaluated on a selection of 36 source code examples from the NIST SARD dataset, which are specifically curated to not contain natural English that indicates the presence, or lack thereof, of a particular vulnerability. The NIST SARD source code dataset contains identified vulnerable lines of source code that are examples of one out of the 839 distinct Common Weakness Enumerations (CWE), allowing for exact quantification of the GPT output classification error rate. A total of 5 GPT models are evaluated, using 10 different inference temperatures and 100 repetitions at each setting, resulting in 5,000 GPT queries per vulnerable source code analyzed. Ultimately, we find that the GPT models that we evaluated are not suitable for fully automated vulnerability scanning because the false positive and false negative rates are too high to likely be useful in practice. However, we do find that the GPT models perform surprisingly well at automated vulnerability detection for some of the test cases, in particular surpassing random sampling, and being able to identify the exact lines of code that are vulnerable albeit at a low success rate. The best performing GPT model result found was Llama-2-70b-chat-hf with inference temperature of 0.1 applied to NIST SARD test case 149165 (which is an example of a buffer overflow vulnerability), which had a binary classification recall score of 1.0 and a precision of 1.0 for correctly and uniquely identifying the vulnerable line of code and the correct CWE number.

Related papers

Case Study: Fine-tuning Small Language Models for Accurate and Private CWE Detection in Python Code [0.0]
Large Language Models (LLMs) have demonstrated significant capabilities in understanding and analyzing code for security vulnerabilities. This work explores the potential of Small Language Models (SLMs) as a viable alternative for accurate, on-premise vulnerability detection.
arXiv Detail & Related papers (2025-04-23T10:05:27Z)
Beyond Natural Language Perplexity: Detecting Dead Code Poisoning in Code Generation Datasets [8.977790462534152]
We propose DePA, a novel line-level detection and cleansing method tailored to the structural properties of code. DePA significantly outperforms existing methods, achieving 0.14-0.19 improvement in detection F1-score and a 44-65% increase in poisoned segment localization precision.
arXiv Detail & Related papers (2025-02-27T16:30:00Z)
Do Neutral Prompts Produce Insecure Code? FormAI-v2 Dataset: Labelling Vulnerabilities in Code Generated by Large Language Models [3.4887856546295333]
This study provides a comparative analysis of state-of-the-art large language models (LLMs) We analyze how likely they generate vulnerabilities when writing simple C programs using a neutral zero-shot prompt.
arXiv Detail & Related papers (2024-04-29T01:24:14Z)
Shifting the Lens: Detecting Malicious npm Packages using Large Language Models [4.479741014073169]
Existing malicious code detection techniques often suffer from high misclassification rates. We present SecurityAI, a malicious code review workflow to detect malicious code using ChatGPT. Our baseline comparison demonstrates a 16% and 9% improvement over static analysis in precision and F1 scores.
arXiv Detail & Related papers (2024-03-18T19:10:12Z)
VGX: Large-Scale Sample Generation for Boosting Learning-Based Software Vulnerability Analyses [30.65722096096949]
This paper proposes VGX, a new technique aimed for large-scale generation of high-quality vulnerability datasets. VGX materializes vulnerability-injection code editing in identified contexts using patterns of such edits. For in-the-wild sample production, VGX generated 150,392 vulnerable samples, from which we randomly chose 10% to assess how much these samples help vulnerability detection, localization, and repair.
arXiv Detail & Related papers (2023-10-24T01:05:00Z)
Zero-Shot Detection of Machine-Generated Codes [83.0342513054389]
This work proposes a training-free approach for the detection of LLMs-generated codes. We find that existing training-based or zero-shot text detectors are ineffective in detecting code. Our method exhibits robustness against revision attacks and generalizes well to Java codes.
arXiv Detail & Related papers (2023-10-08T10:08:21Z)
Teaching Large Language Models to Self-Debug [62.424077000154945]
Large language models (LLMs) have achieved impressive performance on code generation. We propose Self- Debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations.
arXiv Detail & Related papers (2023-04-11T10:43:43Z)
Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense [56.077252790310176]
We present a paraphrase generation model (DIPPER) that can paraphrase paragraphs, condition on surrounding context, and control lexical diversity and content reordering. Using DIPPER to paraphrase text generated by three large language models (including GPT3.5-davinci-003) successfully evades several detectors, including watermarking. We introduce a simple defense that relies on retrieving semantically-similar generations and must be maintained by a language model API provider.
arXiv Detail & Related papers (2023-03-23T16:29:27Z)
VMCDL: Vulnerability Mining Based on Cascaded Deep Learning Under Source Control Flow [2.561778620560749]
This paper mainly use the c/c++ source code data of the SARD dataset, process the source code of CWE476, CWE469, CWE516 and CWE570 vulnerability types. We propose a new cascading deep learning model VMCDL based on source code control flow to effectively detect vulnerabilities.
arXiv Detail & Related papers (2023-03-13T13:58:39Z)
DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature [143.5381108333212]
We show that text sampled from an large language model tends to occupy negative curvature regions of the model's log probability function. We then define a new curvature-based criterion for judging if a passage is generated from a given LLM. We find DetectGPT is more discriminative than existing zero-shot methods for model sample detection.
arXiv Detail & Related papers (2023-01-26T18:44:06Z)
Fault-Aware Neural Code Rankers [64.41888054066861]
We propose fault-aware neural code rankers that can predict the correctness of a sampled program without executing it. Our fault-aware rankers can significantly increase the pass@1 accuracy of various code generation models.
arXiv Detail & Related papers (2022-06-04T22:01:05Z)
VELVET: a noVel Ensemble Learning approach to automatically locate VulnErable sTatements [62.93814803258067]
This paper presents VELVET, a novel ensemble learning approach to locate vulnerable statements in source code. Our model combines graph-based and sequence-based neural networks to successfully capture the local and global context of a program graph. VELVET achieves 99.6% and 43.6% top-1 accuracy over synthetic data and real-world data, respectively.
arXiv Detail & Related papers (2021-12-20T22:45:27Z)
Security Vulnerability Detection Using Deep Learning Natural Language Processing [1.4591078795663772]
We model software vulnerability detection as a natural language processing (NLP) problem with source code treated as texts. For training and testing, we have built a dataset of over 100,000 files in $C$ programming language with 123 types of vulnerabilities. Experiments generate the best performance of over 93% accuracy in detecting security vulnerabilities.
arXiv Detail & Related papers (2021-05-06T01:28:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.