DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based
Vulnerability Detection
- URL: http://arxiv.org/abs/2304.00409v2
- Date: Wed, 9 Aug 2023 01:21:50 GMT
- Title: DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based
Vulnerability Detection
- Authors: Yizheng Chen, Zhoujie Ding, Lamya Alowain, Xinyun Chen, David Wagner
- Abstract summary: This dataset contains 18,945 vulnerable functions spanning 150 CWEs and 330,492 non-vulnerable functions extracted from 7,514 commits.
Our results show that deep learning is still not ready for vulnerability detection, due to high false positive rate, low F1 score, and difficulty of detecting hard CWEs.
We demonstrate that large language models (LLMs) are a promising research direction for ML-based vulnerability detection, outperforming Graph Neural Networks (GNNs) with code-structure features.
- Score: 29.52887618905746
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose and release a new vulnerable source code dataset. We curate the
dataset by crawling security issue websites, extracting vulnerability-fixing
commits and source codes from the corresponding projects. Our new dataset
contains 18,945 vulnerable functions spanning 150 CWEs and 330,492
non-vulnerable functions extracted from 7,514 commits. Our dataset covers 295
more projects than all previous datasets combined.
Combining our new dataset with previous datasets, we present an analysis of
the challenges and promising research directions of using deep learning for
detecting software vulnerabilities. We study 11 model architectures belonging
to 4 families. Our results show that deep learning is still not ready for
vulnerability detection, due to high false positive rate, low F1 score, and
difficulty of detecting hard CWEs. In particular, we demonstrate an important
generalization challenge for the deployment of deep learning-based models. We
show that increasing the volume of training data may not further improve the
performance of deep learning models for vulnerability detection, but might be
useful to improve the generalization ability to unseen projects.
We also identify hopeful future research directions. We demonstrate that
large language models (LLMs) are a promising research direction for ML-based
vulnerability detection, outperforming Graph Neural Networks (GNNs) with
code-structure features in our experiments. Moreover, developing source code
specific pre-training objectives is a promising research direction to improve
the vulnerability detection performance.
Related papers
- DFEPT: Data Flow Embedding for Enhancing Pre-Trained Model Based Vulnerability Detection [7.802093464108404]
We propose a data flow embedding technique to enhance the performance of pre-trained models in vulnerability detection tasks.
Specifically, we parse data flow graphs from function-level source code, and use the data type of the variable as the node characteristics of the DFG.
Our research shows that DFEPT can provide effective vulnerability semantic information to pre-trained models, achieving an accuracy of 64.97% on the Devign dataset and an F1-Score of 47.9% on the Reveal dataset.
arXiv Detail & Related papers (2024-10-24T07:05:07Z) - RealVul: Can We Detect Vulnerabilities in Web Applications with LLM? [4.467475584754677]
We present RealVul, the first LLM-based framework designed for PHP vulnerability detection.
We can isolate potential vulnerability triggers while streamlining the code and eliminating unnecessary semantic information.
We also address the issue of insufficient PHP vulnerability samples by improving data synthesis methods.
arXiv Detail & Related papers (2024-10-10T03:16:34Z) - Enhancing Pre-Trained Language Models for Vulnerability Detection via Semantic-Preserving Data Augmentation [4.374800396968465]
We propose a data augmentation technique aimed at enhancing the performance of pre-trained language models for vulnerability detection.
By incorporating our augmented dataset in fine-tuning a series of representative code pre-trained models, up to 10.1% increase in accuracy and 23.6% increase in F1 can be achieved.
arXiv Detail & Related papers (2024-09-30T21:44:05Z) - Code Structure-Aware through Line-level Semantic Learning for Code Vulnerability Detection [44.29771620061153]
We propose a novel network architecture based on pre-trained code models, which incorporates structural information awareness.
We introduce a new network architecture, the Code Structure-Aware Network through Line-level Semantic Learning (CSLS), which integrates three key components: global vulnerability awareness, line-structural awareness, and sensitive-line awareness.
arXiv Detail & Related papers (2024-07-26T17:15:58Z) - Security Vulnerability Detection with Multitask Self-Instructed Fine-Tuning of Large Language Models [8.167614500821223]
We introduce MSIVD, multitask self-instructed fine-tuning for vulnerability detection, inspired by chain-of-thought prompting and LLM self-instruction.
Our experiments demonstrate that MSIVD achieves superior performance, outperforming the highest LLM-based vulnerability detector baseline (LineVul) with a F1 score of 0.92 on the BigVul dataset, and 0.48 on the PreciseBugs dataset.
arXiv Detail & Related papers (2024-06-09T19:18:05Z) - Enhancing Code Vulnerability Detection via Vulnerability-Preserving Data Augmentation [29.72520866016839]
Source code vulnerability detection aims to identify inherent vulnerabilities to safeguard software systems from potential attacks.
Many prior studies overlook diverse vulnerability characteristics, simplifying the problem into a binary (0-1) classification task.
FGVulDet employs multiple classifiers to discern characteristics of various vulnerability types and combines their outputs to identify the specific type of vulnerability.
FGVulDet is trained on a large-scale dataset from GitHub, encompassing five different types of vulnerabilities.
arXiv Detail & Related papers (2024-04-15T09:10:52Z) - Zero-shot Retrieval: Augmenting Pre-trained Models with Search Engines [83.65380507372483]
Large pre-trained models can dramatically reduce the amount of task-specific data required to solve a problem, but they often fail to capture domain-specific nuances out of the box.
This paper shows how to leverage recent advances in NLP and multi-modal learning to augment a pre-trained model with search engine retrieval.
arXiv Detail & Related papers (2023-11-29T05:33:28Z) - Deep networks for system identification: a Survey [56.34005280792013]
System identification learns mathematical descriptions of dynamic systems from input-output data.
Main aim of the identified model is to predict new data from previous observations.
We discuss architectures commonly adopted in the literature, like feedforward, convolutional, and recurrent networks.
arXiv Detail & Related papers (2023-01-30T12:38:31Z) - Weakly Supervised Change Detection Using Guided Anisotropic Difusion [97.43170678509478]
We propose original ideas that help us to leverage such datasets in the context of change detection.
First, we propose the guided anisotropic diffusion (GAD) algorithm, which improves semantic segmentation results.
We then show its potential in two weakly-supervised learning strategies tailored for change detection.
arXiv Detail & Related papers (2021-12-31T10:03:47Z) - VELVET: a noVel Ensemble Learning approach to automatically locate
VulnErable sTatements [62.93814803258067]
This paper presents VELVET, a novel ensemble learning approach to locate vulnerable statements in source code.
Our model combines graph-based and sequence-based neural networks to successfully capture the local and global context of a program graph.
VELVET achieves 99.6% and 43.6% top-1 accuracy over synthetic data and real-world data, respectively.
arXiv Detail & Related papers (2021-12-20T22:45:27Z) - Information Obfuscation of Graph Neural Networks [96.8421624921384]
We study the problem of protecting sensitive attributes by information obfuscation when learning with graph structured data.
We propose a framework to locally filter out pre-determined sensitive attributes via adversarial training with the total variation and the Wasserstein distance.
arXiv Detail & Related papers (2020-09-28T17:55:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.