CVEfixes: Automated Collection of Vulnerabilities and Their Fixes from
Open-Source Software
- URL: http://arxiv.org/abs/2107.08760v1
- Date: Mon, 19 Jul 2021 11:34:09 GMT
- Title: CVEfixes: Automated Collection of Vulnerabilities and Their Fixes from
Open-Source Software
- Authors: Guru Prasad Bhandari, Amara Naseer and Leon Moonen (Simula Research
Laboratory, Norway)
- Abstract summary: We implement a fully automated dataset collection tool and share an initial release of the resulting vulnerability dataset named CVEfixes.
The dataset is enriched with meta-data such as programming language, and detailed code and security metrics at five levels of abstraction.
CVEfixes supports various types of data-driven software security research, such as vulnerability prediction, vulnerability classification, vulnerability severity prediction, analysis of vulnerability-related code changes, and automated vulnerability repair.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data-driven research on the automated discovery and repair of security
vulnerabilities in source code requires comprehensive datasets of real-life
vulnerable code and their fixes. To assist in such research, we propose a
method to automatically collect and curate a comprehensive vulnerability
dataset from Common Vulnerabilities and Exposures (CVE) records in the public
National Vulnerability Database (NVD). We implement our approach in a fully
automated dataset collection tool and share an initial release of the resulting
vulnerability dataset named CVEfixes.
The CVEfixes collection tool automatically fetches all available CVE records
from the NVD, gathers the vulnerable code and corresponding fixes from
associated open-source repositories, and organizes the collected information in
a relational database. Moreover, the dataset is enriched with meta-data such as
programming language, and detailed code and security metrics at five levels of
abstraction. The collection can easily be repeated to keep up-to-date with
newly discovered or patched vulnerabilities. The initial release of CVEfixes
spans all published CVEs up to 9 June 2021, covering 5365 CVE records for 1754
open-source projects that were addressed in a total of 5495 vulnerability
fixing commits.
CVEfixes supports various types of data-driven software security research,
such as vulnerability prediction, vulnerability classification, vulnerability
severity prediction, analysis of vulnerability-related code changes, and
automated vulnerability repair.
Related papers
- Discovery of Timeline and Crowd Reaction of Software Vulnerability Disclosures [47.435076500269545]
Apache Log4J was found to be vulnerable to remote code execution attacks.
More than 35,000 packages were forced to update their Log4J libraries with the latest version.
It is practically reasonable for software developers to update their third-party libraries whenever the software vendors have released a vulnerable-free version.
arXiv Detail & Related papers (2024-11-12T01:55:51Z) - ARVO: Atlas of Reproducible Vulnerabilities for Open Source Software [20.927909014593318]
We introduce ARVO: an Atlas of Reproducible Vulnerabilities in Open-source software.
We reproduce more than 5,000 memory vulnerabilities across over 250 projects.
Our dataset can be automatically updated as OSS-Fuzz finds new vulnerabilities.
arXiv Detail & Related papers (2024-08-04T22:13:14Z) - VulZoo: A Comprehensive Vulnerability Intelligence Dataset [12.229092589037808]
VulZoo is a comprehensive vulnerability intelligence dataset that covers 17 popular vulnerability information sources.
We make VulZoo publicly available and maintain it with incremental updates to facilitate future research.
arXiv Detail & Related papers (2024-06-24T06:39:07Z) - MegaVul: A C/C++ Vulnerability Dataset with Comprehensive Code Representation [5.821166713605872]
MegaVul is a newly large-scale and comprehensive C/C++ vulnerability dataset named MegaVul.
We collected all crawlable descriptive information of the vulnerabilities from the CVE database and extracted all vulnerability-related code changes from 28 Git-based websites.
In total, MegaVul contains 17,380 vulnerabilities collected from 992 open-source repositories spanning 169 different vulnerability types from January 2006 to October 2023.
arXiv Detail & Related papers (2024-06-18T09:03:18Z) - Unveiling Hidden Links Between Unseen Security Entities [3.7138962865789353]
VulnScopper is an innovative approach that utilizes multi-modal representation learning, combining Knowledge Graphs (KG) and Natural Processing (NLP)
We evaluate VulnScopper on two major security datasets, the National Vulnerability Database (NVD) and the Red Hat CVE database.
Our results show that VulnScopper outperforms existing methods, achieving up to 78% Hits@10 accuracy in linking CVEs to Common Vulnerabilities and Exposures (CWEs), and Common Platform Languageions (CPEs)
arXiv Detail & Related papers (2024-03-04T13:14:39Z) - REEF: A Framework for Collecting Real-World Vulnerabilities and Fixes [40.401211102969356]
We propose an automated collecting framework REEF to collect REal-world vulnErabilities and Fixes from open-source repositories.
We develop a multi-language crawler to collect vulnerabilities and their fixes, and design metrics to filter for high-quality vulnerability-fix pairs.
Through extensive experiments, we demonstrate that our approach can collect high-quality vulnerability-fix pairs and generate strong explanations.
arXiv Detail & Related papers (2023-09-15T02:50:08Z) - CodeLMSec Benchmark: Systematically Evaluating and Finding Security
Vulnerabilities in Black-Box Code Language Models [58.27254444280376]
Large language models (LLMs) for automatic code generation have achieved breakthroughs in several programming tasks.
Training data for these models is usually collected from the Internet (e.g., from open-source repositories) and is likely to contain faults and security vulnerabilities.
This unsanitized training data can cause the language models to learn these vulnerabilities and propagate them during the code generation procedure.
arXiv Detail & Related papers (2023-02-08T11:54:07Z) - VELVET: a noVel Ensemble Learning approach to automatically locate
VulnErable sTatements [62.93814803258067]
This paper presents VELVET, a novel ensemble learning approach to locate vulnerable statements in source code.
Our model combines graph-based and sequence-based neural networks to successfully capture the local and global context of a program graph.
VELVET achieves 99.6% and 43.6% top-1 accuracy over synthetic data and real-world data, respectively.
arXiv Detail & Related papers (2021-12-20T22:45:27Z) - FRUIT: Faithfully Reflecting Updated Information in Text [106.40177769765512]
We introduce the novel generation task of *faithfully reflecting updated information in text*(FRUIT)
Our analysis shows that developing models that can update articles faithfully requires new capabilities for neural generation models.
arXiv Detail & Related papers (2021-12-16T05:21:24Z) - Dataset Security for Machine Learning: Data Poisoning, Backdoor Attacks,
and Defenses [150.64470864162556]
This work systematically categorizes and discusses a wide range of dataset vulnerabilities and exploits.
In addition to describing various poisoning and backdoor threat models and the relationships among them, we develop their unified taxonomy.
arXiv Detail & Related papers (2020-12-18T22:38:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.