VUDENC: Vulnerability Detection with Deep Learning on a Natural Codebase
for Python
- URL: http://arxiv.org/abs/2201.08441v1
- Date: Thu, 20 Jan 2022 20:29:22 GMT
- Title: VUDENC: Vulnerability Detection with Deep Learning on a Natural Codebase
for Python
- Authors: Laura Wartschinski, Yannic Noller, Thomas Vogel, Timo Kehrer, Lars
Grunske
- Abstract summary: VUDENC is a deep learning-based vulnerability detection tool.
It learns features of vulnerable code from a large and real-world Python corpus.
VUDENC achieves a recall of 78%-87%, a precision of 82%-96%, and an F1 score of 80%-90%.
- Score: 8.810543294798485
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Context: Identifying potential vulnerable code is important to improve the
security of our software systems. However, the manual detection of software
vulnerabilities requires expert knowledge and is time-consuming, and must be
supported by automated techniques. Objective: Such automated vulnerability
detection techniques should achieve a high accuracy, point developers directly
to the vulnerable code fragments, scale to real-world software, generalize
across the boundaries of a specific software project, and require no or only
moderate setup or configuration effort. Method: In this article, we present
VUDENC (Vulnerability Detection with Deep Learning on a Natural Codebase), a
deep learning-based vulnerability detection tool that automatically learns
features of vulnerable code from a large and real-world Python codebase. VUDENC
applies a word2vec model to identify semantically similar code tokens and to
provide a vector representation. A network of long-short-term memory cells
(LSTM) is then used to classify vulnerable code token sequences at a
fine-grained level, highlight the specific areas in the source code that are
likely to contain vulnerabilities, and provide confidence levels for its
predictions. Results: To evaluate VUDENC, we used 1,009 vulnerability-fixing
commits from different GitHub repositories that contain seven different types
of vulnerabilities (SQL injection, XSS, Command injection, XSRF, Remote code
execution, Path disclosure, Open redirect) for training. In the experimental
evaluation, VUDENC achieves a recall of 78%-87%, a precision of 82%-96%, and an
F1 score of 80%-90%. VUDENC's code, the datasets for the vulnerabilities, and
the Python corpus for the word2vec model are available for reproduction.
Conclusions: Our experimental results suggest...
Related papers
- Vulnerability Detection in C/C++ Code with Deep Learning [3.105656247358225]
We train neural networks with program slices extracted from the source code of C/C++ programs to detect software vulnerabilities.
Our result shows that combining different types of characteristics of source code and using a balanced number of vulnerable program slices and nonvulnerable program slices produce a balanced accuracy.
arXiv Detail & Related papers (2024-05-20T21:39:19Z) - The Vulnerability Is in the Details: Locating Fine-grained Information of Vulnerable Code Identified by Graph-based Detectors [33.395068754566935]
VULEXPLAINER is a tool for locating vulnerability-critical code lines from coarse-level vulnerable code snippets.
It can flag the vulnerability-triggering code statements with an accuracy of around 90% against eight common C/C++ vulnerabilities.
arXiv Detail & Related papers (2024-01-05T10:15:04Z) - Vulnerability Detection Using Two-Stage Deep Learning Models [0.0]
Two deep learning models were proposed for vulnerability detection in C/C++ source codes.
The first stage is CNN which detects if the source code contains any vulnerability.
The second stage is CNN-LTSM that classifies this vulnerability into a class of 50 different types of vulnerabilities.
arXiv Detail & Related papers (2023-05-08T22:12:34Z) - CodeLMSec Benchmark: Systematically Evaluating and Finding Security
Vulnerabilities in Black-Box Code Language Models [58.27254444280376]
Large language models (LLMs) for automatic code generation have achieved breakthroughs in several programming tasks.
Training data for these models is usually collected from the Internet (e.g., from open-source repositories) and is likely to contain faults and security vulnerabilities.
This unsanitized training data can cause the language models to learn these vulnerabilities and propagate them during the code generation procedure.
arXiv Detail & Related papers (2023-02-08T11:54:07Z) - Pre-trained Encoders in Self-Supervised Learning Improve Secure and
Privacy-preserving Supervised Learning [63.45532264721498]
Self-supervised learning is an emerging technique to pre-train encoders using unlabeled data.
We perform first systematic, principled measurement study to understand whether and when a pretrained encoder can address the limitations of secure or privacy-preserving supervised learning algorithms.
arXiv Detail & Related papers (2022-12-06T21:35:35Z) - Deep-Learning-based Vulnerability Detection in Binary Executables [0.0]
We present a supervised deep learning approach using recurrent neural networks for the application of vulnerability detection based on binary executables.
A dataset with 50,651 samples of vulnerable code in the form of a standardized LLVM Intermediate Representation is used.
A binary classification was established for detecting the presence of an arbitrary vulnerability, and a multi-class model was trained for the identification of the exact vulnerability.
arXiv Detail & Related papers (2022-11-25T10:33:33Z) - Statement-Level Vulnerability Detection: Learning Vulnerability Patterns Through Information Theory and Contrastive Learning [31.15123852246431]
We propose a novel end-to-end deep learning-based approach to identify the vulnerability-relevant code statements of a specific function.
Inspired by the structures observed in real-world vulnerable code, we first leverage mutual information for learning a set of latent variables.
We then propose novel clustered spatial contrastive learning in order to further improve the representation learning.
arXiv Detail & Related papers (2022-09-20T00:46:20Z) - Revisiting Code Search in a Two-Stage Paradigm [67.02322603435628]
TOSS is a two-stage fusion code search framework.
It first uses IR-based and bi-encoder models to efficiently recall a small number of top-k code candidates.
It then uses fine-grained cross-encoders for finer ranking.
arXiv Detail & Related papers (2022-08-24T02:34:27Z) - VELVET: a noVel Ensemble Learning approach to automatically locate
VulnErable sTatements [62.93814803258067]
This paper presents VELVET, a novel ensemble learning approach to locate vulnerable statements in source code.
Our model combines graph-based and sequence-based neural networks to successfully capture the local and global context of a program graph.
VELVET achieves 99.6% and 43.6% top-1 accuracy over synthetic data and real-world data, respectively.
arXiv Detail & Related papers (2021-12-20T22:45:27Z) - Software Vulnerability Detection via Deep Learning over Disaggregated
Code Graph Representation [57.92972327649165]
This work explores a deep learning approach to automatically learn the insecure patterns from code corpora.
Because code naturally admits graph structures with parsing, we develop a novel graph neural network (GNN) to exploit both the semantic context and structural regularity of a program.
arXiv Detail & Related papers (2021-09-07T21:24:36Z) - COSEA: Convolutional Code Search with Layer-wise Attention [90.35777733464354]
We propose a new deep learning architecture, COSEA, which leverages convolutional neural networks with layer-wise attention to capture the code's intrinsic structural logic.
COSEA can achieve significant improvements over state-of-the-art methods on code search tasks.
arXiv Detail & Related papers (2020-10-19T13:53:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.