MegaVul: A C/C++ Vulnerability Dataset with Comprehensive Code Representation
- URL: http://arxiv.org/abs/2406.12415v1
- Date: Tue, 18 Jun 2024 09:03:18 GMT
- Title: MegaVul: A C/C++ Vulnerability Dataset with Comprehensive Code Representation
- Authors: Chao Ni, Liyu Shen, Xiaohu Yang, Yan Zhu, Shaohua Wang,
- Abstract summary: MegaVul is a newly large-scale and comprehensive C/C++ vulnerability dataset named MegaVul.
We collected all crawlable descriptive information of the vulnerabilities from the CVE database and extracted all vulnerability-related code changes from 28 Git-based websites.
In total, MegaVul contains 17,380 vulnerabilities collected from 992 open-source repositories spanning 169 different vulnerability types from January 2006 to October 2023.
- Score: 5.821166713605872
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We constructed a newly large-scale and comprehensive C/C++ vulnerability dataset named MegaVul by crawling the Common Vulnerabilities and Exposures (CVE) database and CVE-related open-source projects. Specifically, we collected all crawlable descriptive information of the vulnerabilities from the CVE database and extracted all vulnerability-related code changes from 28 Git-based websites. We adopt advanced tools to ensure the extracted code integrality and enrich the code with four different transformed representations. In total, MegaVul contains 17,380 vulnerabilities collected from 992 open-source repositories spanning 169 different vulnerability types disclosed from January 2006 to October 2023. Thus, MegaVul can be used for a variety of software security-related tasks including detecting vulnerabilities and assessing vulnerability severity. All information is stored in the JSON format for easy usage. MegaVul is publicly available on GitHub and will be continuously updated. It can be easily extended to other programming languages.
Related papers
- eyeballvul: a future-proof benchmark for vulnerability detection in the wild [0.0]
eyeballvul is a benchmark designed to test the vulnerability detection capabilities of language models at scale.
It is sourced and updated weekly from the stream of published vulnerabilities in open-source repositories.
eyeballvul contains 24,000+ vulnerabilities across 6,000+ revisions and 5,000+ repositories, and is around 55GB in size.
arXiv Detail & Related papers (2024-07-11T17:46:21Z) - VulZoo: A Comprehensive Vulnerability Intelligence Dataset [12.229092589037808]
VulZoo is a comprehensive vulnerability intelligence dataset that covers 17 popular vulnerability information sources.
We make VulZoo publicly available and maintain it with incremental updates to facilitate future research.
arXiv Detail & Related papers (2024-06-24T06:39:07Z) - FV8: A Forced Execution JavaScript Engine for Detecting Evasive Techniques [53.288368877654705]
FV8 is a modified V8 JavaScript engine designed to identify evasion techniques in JavaScript code.
It selectively enforces code execution on APIs that conditionally inject dynamic code.
It identifies 1,443 npm packages and 164 (82%) extensions containing at least one type of evasion.
arXiv Detail & Related papers (2024-05-21T19:54:19Z) - Just another copy and paste? Comparing the security vulnerabilities of ChatGPT generated code and StackOverflow answers [4.320393382724067]
This study empirically compares the vulnerabilities of ChatGPT and StackOverflow snippets.
ChatGPT contained 248 vulnerabilities compared to the 302 vulnerabilities found in SO snippets, producing 20% fewer vulnerabilities with a statistically significant difference.
Our findings suggest developers are under-educated on insecure code propagation from both platforms.
arXiv Detail & Related papers (2024-03-22T20:06:41Z) - CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion [117.178835165855]
This paper introduces CodeAttack, a framework that transforms natural language inputs into code inputs.
Our studies reveal a new and universal safety vulnerability of these models against code input.
We find that a larger distribution gap between CodeAttack and natural language leads to weaker safety generalization.
arXiv Detail & Related papers (2024-03-12T17:55:38Z) - On the Security Blind Spots of Software Composition Analysis [46.1389163921338]
We present a novel approach to detect vulnerable clones in the Maven repository.
We retrieve over 53k potential vulnerable clones from Maven Central.
We detect 727 confirmed vulnerable clones and synthesize a testable proof-of-vulnerability project for each of those.
arXiv Detail & Related papers (2023-06-08T20:14:46Z) - CodeLMSec Benchmark: Systematically Evaluating and Finding Security
Vulnerabilities in Black-Box Code Language Models [58.27254444280376]
Large language models (LLMs) for automatic code generation have achieved breakthroughs in several programming tasks.
Training data for these models is usually collected from the Internet (e.g., from open-source repositories) and is likely to contain faults and security vulnerabilities.
This unsanitized training data can cause the language models to learn these vulnerabilities and propagate them during the code generation procedure.
arXiv Detail & Related papers (2023-02-08T11:54:07Z) - Detecting Security Patches via Behavioral Data in Code Repositories [11.052678122289871]
We show a system to automatically identify security patches using only the developer behavior in the Git repository.
We showed we can reveal concealed security patches with an accuracy of 88.3% and F1 Score of 89.8%.
arXiv Detail & Related papers (2023-02-04T06:43:07Z) - DeepKE: A Deep Learning Based Knowledge Extraction Toolkit for Knowledge
Base Population [95.0099875111663]
DeepKE implements various information extraction tasks, including named entity recognition, relation extraction and attribute extraction.
DeepKE allows developers and researchers to customize datasets and models to extract information from unstructured data according to their requirements.
arXiv Detail & Related papers (2022-01-10T13:29:05Z) - VELVET: a noVel Ensemble Learning approach to automatically locate
VulnErable sTatements [62.93814803258067]
This paper presents VELVET, a novel ensemble learning approach to locate vulnerable statements in source code.
Our model combines graph-based and sequence-based neural networks to successfully capture the local and global context of a program graph.
VELVET achieves 99.6% and 43.6% top-1 accuracy over synthetic data and real-world data, respectively.
arXiv Detail & Related papers (2021-12-20T22:45:27Z) - CVEfixes: Automated Collection of Vulnerabilities and Their Fixes from
Open-Source Software [0.0]
We implement a fully automated dataset collection tool and share an initial release of the resulting vulnerability dataset named CVEfixes.
The dataset is enriched with meta-data such as programming language, and detailed code and security metrics at five levels of abstraction.
CVEfixes supports various types of data-driven software security research, such as vulnerability prediction, vulnerability classification, vulnerability severity prediction, analysis of vulnerability-related code changes, and automated vulnerability repair.
arXiv Detail & Related papers (2021-07-19T11:34:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.