Security Vulnerability Detection Using Deep Learning Natural Language
Processing
- URL: http://arxiv.org/abs/2105.02388v1
- Date: Thu, 6 May 2021 01:28:21 GMT
- Title: Security Vulnerability Detection Using Deep Learning Natural Language
Processing
- Authors: Noah Ziems, Shaoen Wu
- Abstract summary: We model software vulnerability detection as a natural language processing (NLP) problem with source code treated as texts.
For training and testing, we have built a dataset of over 100,000 files in $C$ programming language with 123 types of vulnerabilities.
Experiments generate the best performance of over 93% accuracy in detecting security vulnerabilities.
- Score: 1.4591078795663772
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Detecting security vulnerabilities in software before they are exploited has
been a challenging problem for decades. Traditional code analysis methods have
been proposed, but are often ineffective and inefficient. In this work, we
model software vulnerability detection as a natural language processing (NLP)
problem with source code treated as texts, and address the automated software
venerability detection with recent advanced deep learning NLP models assisted
by transfer learning on written English. For training and testing, we have
preprocessed the NIST NVD/SARD databases and built a dataset of over 100,000
files in $C$ programming language with 123 types of vulnerabilities. The
extensive experiments generate the best performance of over 93\% accuracy in
detecting security vulnerabilities.
Related papers
- Secret Breach Prevention in Software Issue Reports [2.8747015994080285]
This paper presents a novel technique for secret breach detection in software issue reports.
We highlight the challenges posed by noise, such as log files, URLs, commit IDs, stack traces, and dummy passwords.
We propose an approach combining the strengths of state-of-the-artes with the contextual understanding of language models.
arXiv Detail & Related papers (2024-10-31T06:14:17Z) - RealVul: Can We Detect Vulnerabilities in Web Applications with LLM? [4.467475584754677]
We present RealVul, the first LLM-based framework designed for PHP vulnerability detection.
We can isolate potential vulnerability triggers while streamlining the code and eliminating unnecessary semantic information.
We also address the issue of insufficient PHP vulnerability samples by improving data synthesis methods.
arXiv Detail & Related papers (2024-10-10T03:16:34Z) - Automated Software Vulnerability Static Code Analysis Using Generative Pre-Trained Transformer Models [0.8192907805418583]
Generative Pre-Trained Transformer models have been shown to be surprisingly effective at a variety of natural language processing tasks.
We evaluate the effectiveness of open source GPT models for the task of automatic identification of the presence of vulnerable code syntax.
arXiv Detail & Related papers (2024-07-31T23:33:26Z) - CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion [117.178835165855]
This paper introduces CodeAttack, a framework that transforms natural language inputs into code inputs.
Our studies reveal a new and universal safety vulnerability of these models against code input.
We find that a larger distribution gap between CodeAttack and natural language leads to weaker safety generalization.
arXiv Detail & Related papers (2024-03-12T17:55:38Z) - Zero-Shot Detection of Machine-Generated Codes [83.0342513054389]
This work proposes a training-free approach for the detection of LLMs-generated codes.
We find that existing training-based or zero-shot text detectors are ineffective in detecting code.
Our method exhibits robustness against revision attacks and generalizes well to Java codes.
arXiv Detail & Related papers (2023-10-08T10:08:21Z) - The FormAI Dataset: Generative AI in Software Security Through the Lens of Formal Verification [3.2925005312612323]
This paper presents the FormAI dataset, a large collection of 112, 000 AI-generated C programs with vulnerability classification.
Every program is labeled with the vulnerabilities found within the source code, indicating the type, line number, and vulnerable function name.
We make the source code available for the 112, 000 programs, accompanied by a separate file containing the vulnerabilities detected in each program.
arXiv Detail & Related papers (2023-07-05T10:39:58Z) - CodeLMSec Benchmark: Systematically Evaluating and Finding Security
Vulnerabilities in Black-Box Code Language Models [58.27254444280376]
Large language models (LLMs) for automatic code generation have achieved breakthroughs in several programming tasks.
Training data for these models is usually collected from the Internet (e.g., from open-source repositories) and is likely to contain faults and security vulnerabilities.
This unsanitized training data can cause the language models to learn these vulnerabilities and propagate them during the code generation procedure.
arXiv Detail & Related papers (2023-02-08T11:54:07Z) - On the Security Vulnerabilities of Text-to-SQL Models [34.749129843281196]
We show that modules within six commercial applications can be manipulated to produce malicious code.
This is the first demonstration that NLP models can be exploited as attack vectors in the wild.
The aim of this work is to draw the community's attention to potential software security issues associated with NLP algorithms.
arXiv Detail & Related papers (2022-11-28T14:38:45Z) - VELVET: a noVel Ensemble Learning approach to automatically locate
VulnErable sTatements [62.93814803258067]
This paper presents VELVET, a novel ensemble learning approach to locate vulnerable statements in source code.
Our model combines graph-based and sequence-based neural networks to successfully capture the local and global context of a program graph.
VELVET achieves 99.6% and 43.6% top-1 accuracy over synthetic data and real-world data, respectively.
arXiv Detail & Related papers (2021-12-20T22:45:27Z) - Lexically Aware Semi-Supervised Learning for OCR Post-Correction [90.54336622024299]
Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents.
Previous work has demonstrated the utility of neural post-correction methods on recognition of less-well-resourced languages.
We present a semi-supervised learning method that makes it possible to utilize raw images to improve performance.
arXiv Detail & Related papers (2021-11-04T04:39:02Z) - Exploring Software Naturalness through Neural Language Models [56.1315223210742]
The Software Naturalness hypothesis argues that programming languages can be understood through the same techniques used in natural language processing.
We explore this hypothesis through the use of a pre-trained transformer-based language model to perform code analysis tasks.
arXiv Detail & Related papers (2020-06-22T21:56:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.