Related papers: CVEfixes: Automated Collection of Vulnerabilities and Their Fixes from Open-Source Software

CVEfixes: Automated Collection of Vulnerabilities and Their Fixes from Open-Source Software

URL: http://arxiv.org/abs/2107.08760v1
Date: Mon, 19 Jul 2021 11:34:09 GMT
Title: CVEfixes: Automated Collection of Vulnerabilities and Their Fixes from Open-Source Software
Authors: Guru Prasad Bhandari, Amara Naseer and Leon Moonen (Simula Research Laboratory, Norway)
Abstract summary: We implement a fully automated dataset collection tool and share an initial release of the resulting vulnerability dataset named CVEfixes. The dataset is enriched with meta-data such as programming language, and detailed code and security metrics at five levels of abstraction. CVEfixes supports various types of data-driven software security research, such as vulnerability prediction, vulnerability classification, vulnerability severity prediction, analysis of vulnerability-related code changes, and automated vulnerability repair.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Data-driven research on the automated discovery and repair of security vulnerabilities in source code requires comprehensive datasets of real-life vulnerable code and their fixes. To assist in such research, we propose a method to automatically collect and curate a comprehensive vulnerability dataset from Common Vulnerabilities and Exposures (CVE) records in the public National Vulnerability Database (NVD). We implement our approach in a fully automated dataset collection tool and share an initial release of the resulting vulnerability dataset named CVEfixes. The CVEfixes collection tool automatically fetches all available CVE records from the NVD, gathers the vulnerable code and corresponding fixes from associated open-source repositories, and organizes the collected information in a relational database. Moreover, the dataset is enriched with meta-data such as programming language, and detailed code and security metrics at five levels of abstraction. The collection can easily be repeated to keep up-to-date with newly discovered or patched vulnerabilities. The initial release of CVEfixes spans all published CVEs up to 9 June 2021, covering 5365 CVE records for 1754 open-source projects that were addressed in a total of 5495 vulnerability fixing commits. CVEfixes supports various types of data-driven software security research, such as vulnerability prediction, vulnerability classification, vulnerability severity prediction, analysis of vulnerability-related code changes, and automated vulnerability repair.

Related papers

CommitShield: Tracking Vulnerability Introduction and Fix in Version Control Systems [15.037460085046806]
CommitShield is a tool for detecting vulnerabilities in code commits. It combines the code analysis capabilities of static analysis tools with the natural language and code understanding capabilities of large language models. We show that CommitShield improves recall by 76%-87% over state-of-the-art methods in the vulnerability fix detection task.
arXiv Detail & Related papers (2025-01-07T08:52:55Z)
Discovery of Timeline and Crowd Reaction of Software Vulnerability Disclosures [47.435076500269545]
Apache Log4J was found to be vulnerable to remote code execution attacks. More than 35,000 packages were forced to update their Log4J libraries with the latest version. It is practically reasonable for software developers to update their third-party libraries whenever the software vendors have released a vulnerable-free version.
arXiv Detail & Related papers (2024-11-12T01:55:51Z)
ARVO: Atlas of Reproducible Vulnerabilities for Open Source Software [20.927909014593318]
We introduce ARVO: an Atlas of Reproducible Vulnerabilities in Open-source software. We reproduce more than 5,000 memory vulnerabilities across over 250 projects. Our dataset can be automatically updated as OSS-Fuzz finds new vulnerabilities.
arXiv Detail & Related papers (2024-08-04T22:13:14Z)
VulZoo: A Comprehensive Vulnerability Intelligence Dataset [12.229092589037808]
VulZoo is a comprehensive vulnerability intelligence dataset that covers 17 popular vulnerability information sources. We make VulZoo publicly available and maintain it with incremental updates to facilitate future research.
arXiv Detail & Related papers (2024-06-24T06:39:07Z)
MegaVul: A C/C++ Vulnerability Dataset with Comprehensive Code Representation [5.821166713605872]
MegaVul is a newly large-scale and comprehensive C/C++ vulnerability dataset named MegaVul. We collected all crawlable descriptive information of the vulnerabilities from the CVE database and extracted all vulnerability-related code changes from 28 Git-based websites. In total, MegaVul contains 17,380 vulnerabilities collected from 992 open-source repositories spanning 169 different vulnerability types from January 2006 to October 2023.
arXiv Detail & Related papers (2024-06-18T09:03:18Z)
Vulnerability Detection with Code Language Models: How Far Are We? [40.455600722638906]
PrimeVul is a new dataset for training and evaluating code LMs for vulnerability detection. It incorporates a novel set of data labeling techniques that achieve comparable label accuracy to human-verified benchmarks. It also implements a rigorous data de-duplication and chronological data splitting strategy to mitigate data leakage issues.
arXiv Detail & Related papers (2024-03-27T14:34:29Z)
Profile of Vulnerability Remediations in Dependencies Using Graph Analysis [40.35284812745255]
This research introduces graph analysis methods and a modified Graph Attention Convolutional Neural Network (GAT) model. We analyze control flow graphs to profile breaking changes in applications occurring from dependency upgrades intended to remediate vulnerabilities. Results demonstrate the effectiveness of the enhanced GAT model in offering nuanced insights into the relational dynamics of code vulnerabilities.
arXiv Detail & Related papers (2024-03-08T02:01:47Z)
Unveiling Hidden Links Between Unseen Security Entities [3.7138962865789353]
VulnScopper is an innovative approach that utilizes multi-modal representation learning, combining Knowledge Graphs (KG) and Natural Processing (NLP) We evaluate VulnScopper on two major security datasets, the National Vulnerability Database (NVD) and the Red Hat CVE database. Our results show that VulnScopper outperforms existing methods, achieving up to 78% Hits@10 accuracy in linking CVEs to Common Vulnerabilities and Exposures (CWEs), and Common Platform Languageions (CPEs)
arXiv Detail & Related papers (2024-03-04T13:14:39Z)
REEF: A Framework for Collecting Real-World Vulnerabilities and Fixes [40.401211102969356]
We propose an automated collecting framework REEF to collect REal-world vulnErabilities and Fixes from open-source repositories. We develop a multi-language crawler to collect vulnerabilities and their fixes, and design metrics to filter for high-quality vulnerability-fix pairs. Through extensive experiments, we demonstrate that our approach can collect high-quality vulnerability-fix pairs and generate strong explanations.
arXiv Detail & Related papers (2023-09-15T02:50:08Z)
On the Security Blind Spots of Software Composition Analysis [46.1389163921338]
We present a novel approach to detect vulnerable clones in the Maven repository. We retrieve over 53k potential vulnerable clones from Maven Central. We detect 727 confirmed vulnerable clones and synthesize a testable proof-of-vulnerability project for each of those.
arXiv Detail & Related papers (2023-06-08T20:14:46Z)
CodeLMSec Benchmark: Systematically Evaluating and Finding Security Vulnerabilities in Black-Box Code Language Models [58.27254444280376]
Large language models (LLMs) for automatic code generation have achieved breakthroughs in several programming tasks. Training data for these models is usually collected from the Internet (e.g., from open-source repositories) and is likely to contain faults and security vulnerabilities. This unsanitized training data can cause the language models to learn these vulnerabilities and propagate them during the code generation procedure.
arXiv Detail & Related papers (2023-02-08T11:54:07Z)
VELVET: a noVel Ensemble Learning approach to automatically locate VulnErable sTatements [62.93814803258067]
This paper presents VELVET, a novel ensemble learning approach to locate vulnerable statements in source code. Our model combines graph-based and sequence-based neural networks to successfully capture the local and global context of a program graph. VELVET achieves 99.6% and 43.6% top-1 accuracy over synthetic data and real-world data, respectively.
arXiv Detail & Related papers (2021-12-20T22:45:27Z)
FRUIT: Faithfully Reflecting Updated Information in Text [106.40177769765512]
We introduce the novel generation task of *faithfully reflecting updated information in text*(FRUIT) Our analysis shows that developing models that can update articles faithfully requires new capabilities for neural generation models.
arXiv Detail & Related papers (2021-12-16T05:21:24Z)
Dataset Security for Machine Learning: Data Poisoning, Backdoor Attacks, and Defenses [150.64470864162556]
This work systematically categorizes and discusses a wide range of dataset vulnerabilities and exploits. In addition to describing various poisoning and backdoor threat models and the relationships among them, we develop their unified taxonomy.
arXiv Detail & Related papers (2020-12-18T22:38:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.