Related papers: VulZoo: A Comprehensive Vulnerability Intelligence Dataset

VulZoo: A Comprehensive Vulnerability Intelligence Dataset

URL: http://arxiv.org/abs/2406.16347v1
Date: Mon, 24 Jun 2024 06:39:07 GMT
Title: VulZoo: A Comprehensive Vulnerability Intelligence Dataset
Authors: Bonan Ruan, Jiahao Liu, Weibo Zhao, Zhenkai Liang,
Abstract summary: VulZoo is a comprehensive vulnerability intelligence dataset that covers 17 popular vulnerability information sources. We make VulZoo publicly available and maintain it with incremental updates to facilitate future research.
Score: 12.229092589037808
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Software vulnerabilities pose critical security and risk concerns for many software systems. Many techniques have been proposed to effectively assess and prioritize these vulnerabilities before they cause serious consequences. To evaluate their performance, these solutions often craft their own experimental datasets from limited information sources, such as MITRE CVE and NVD, lacking a global overview of broad vulnerability intelligence. The repetitive data preparation process further complicates the verification and comparison of new solutions. To resolve this issue, in this paper, we propose VulZoo, a comprehensive vulnerability intelligence dataset that covers 17 popular vulnerability information sources. We also construct connections among these sources, enabling more straightforward configuration and adaptation for different vulnerability assessment tasks (e.g., vulnerability type prediction). Additionally, VulZoo provides utility scripts for automatic data synchronization and cleaning, relationship mining, and statistics generation. We make VulZoo publicly available and maintain it with incremental updates to facilitate future research. We believe that VulZoo serves as a valuable input to vulnerability assessment and prioritization studies. The dataset with utility scripts is available at https://github.com/NUS-Curiosity/VulZoo.

Related papers

CyberGym: Evaluating AI Agents' Cybersecurity Capabilities with Real-World Vulnerabilities at Scale [46.76144797837242]
Large language model (LLM) agents are becoming increasingly skilled at handling cybersecurity tasks autonomously.<n>Existing benchmarks fall short, often failing to capture real-world scenarios or being limited in scope.<n>We introduce CyberGym, a large-scale and high-quality cybersecurity evaluation framework featuring 1,507 real-world vulnerabilities.
arXiv Detail & Related papers (2025-06-03T07:35:14Z)
Improving Data Curation of Software Vulnerability Patches through Uncertainty Quantification [6.916509590637601]
We propose an approach employing Uncertainty Quantification (UQ) to curate datasets of publicly-available software vulnerability patches. Model Ensemble and heteroscedastic models are the best choices for vulnerability patch datasets.
arXiv Detail & Related papers (2024-11-18T15:37:28Z)
RealVul: Can We Detect Vulnerabilities in Web Applications with LLM? [4.467475584754677]
We present RealVul, the first LLM-based framework designed for PHP vulnerability detection. We can isolate potential vulnerability triggers while streamlining the code and eliminating unnecessary semantic information. We also address the issue of insufficient PHP vulnerability samples by improving data synthesis methods.
arXiv Detail & Related papers (2024-10-10T03:16:34Z)
The Impact of SBOM Generators on Vulnerability Assessment in Python: A Comparison and a Novel Approach [56.4040698609393]
Software Bill of Materials (SBOM) has been promoted as a tool to increase transparency and verifiability in software composition. Current SBOM generation tools often suffer from inaccuracies in identifying components and dependencies. We propose PIP-sbom, a novel pip-inspired solution that addresses their shortcomings.
arXiv Detail & Related papers (2024-09-10T10:12:37Z)
KGV: Integrating Large Language Models with Knowledge Graphs for Cyber Threat Intelligence Credibility Assessment [38.312774244521]
We propose a knowledge graph-based verifier for Cyber Threat Intelligence (CTI) quality assessment framework. Our approach introduces Large Language Models (LLMs) to automatically extract OSCTI key claims to be verified. To fill the gap in the research field, we created and made public the first dataset for threat intelligence assessment from heterogeneous sources.
arXiv Detail & Related papers (2024-08-15T11:32:46Z)
ARVO: Atlas of Reproducible Vulnerabilities for Open Source Software [20.927909014593318]
We introduce ARVO: an Atlas of Reproducible Vulnerabilities in Open-source software. We reproduce more than 5,000 memory vulnerabilities across over 250 projects. Our dataset can be automatically updated as OSS-Fuzz finds new vulnerabilities.
arXiv Detail & Related papers (2024-08-04T22:13:14Z)
Robust Utility-Preserving Text Anonymization Based on Large Language Models [80.5266278002083]
Anonymizing text that contains sensitive information is crucial for a wide range of applications.<n>Existing techniques face the emerging challenges of the re-identification ability of large language models.<n>We propose a framework composed of three key components: a privacy evaluator, a utility evaluator, and an optimization component.
arXiv Detail & Related papers (2024-07-16T14:28:56Z)
"Glue pizza and eat rocks" -- Exploiting Vulnerabilities in Retrieval-Augmented Generative Models [74.05368440735468]
Retrieval-Augmented Generative (RAG) models enhance Large Language Models (LLMs) In this paper, we demonstrate a security threat where adversaries can exploit the openness of these knowledge bases.
arXiv Detail & Related papers (2024-06-26T05:36:23Z)
On Security Weaknesses and Vulnerabilities in Deep Learning Systems [32.14068820256729]
We specifically look into deep learning (DL) framework and perform the first systematic study of vulnerabilities in DL systems. We propose a two-stream data analysis framework to explore vulnerability patterns from various databases. We conducted a large-scale empirical study of 3,049 DL vulnerabilities to better understand the patterns of vulnerability and the challenges in fixing them.
arXiv Detail & Related papers (2024-06-12T23:04:13Z)
Vulnerability Detection with Code Language Models: How Far Are We? [40.455600722638906]
PrimeVul is a new dataset for training and evaluating code LMs for vulnerability detection. It incorporates a novel set of data labeling techniques that achieve comparable label accuracy to human-verified benchmarks. It also implements a rigorous data de-duplication and chronological data splitting strategy to mitigate data leakage issues.
arXiv Detail & Related papers (2024-03-27T14:34:29Z)
Profile of Vulnerability Remediations in Dependencies Using Graph Analysis [40.35284812745255]
This research introduces graph analysis methods and a modified Graph Attention Convolutional Neural Network (GAT) model. We analyze control flow graphs to profile breaking changes in applications occurring from dependency upgrades intended to remediate vulnerabilities. Results demonstrate the effectiveness of the enhanced GAT model in offering nuanced insights into the relational dynamics of code vulnerabilities.
arXiv Detail & Related papers (2024-03-08T02:01:47Z)
REEF: A Framework for Collecting Real-World Vulnerabilities and Fixes [40.401211102969356]
We propose an automated collecting framework REEF to collect REal-world vulnErabilities and Fixes from open-source repositories. We develop a multi-language crawler to collect vulnerabilities and their fixes, and design metrics to filter for high-quality vulnerability-fix pairs. Through extensive experiments, we demonstrate that our approach can collect high-quality vulnerability-fix pairs and generate strong explanations.
arXiv Detail & Related papers (2023-09-15T02:50:08Z)
DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection [29.52887618905746]
This dataset contains 18,945 vulnerable functions spanning 150 CWEs and 330,492 non-vulnerable functions extracted from 7,514 commits. Our results show that deep learning is still not ready for vulnerability detection, due to high false positive rate, low F1 score, and difficulty of detecting hard CWEs. We demonstrate that large language models (LLMs) are a promising research direction for ML-based vulnerability detection, outperforming Graph Neural Networks (GNNs) with code-structure features.
arXiv Detail & Related papers (2023-04-01T23:29:14Z)
CodeLMSec Benchmark: Systematically Evaluating and Finding Security Vulnerabilities in Black-Box Code Language Models [58.27254444280376]
Large language models (LLMs) for automatic code generation have achieved breakthroughs in several programming tasks. Training data for these models is usually collected from the Internet (e.g., from open-source repositories) and is likely to contain faults and security vulnerabilities. This unsanitized training data can cause the language models to learn these vulnerabilities and propagate them during the code generation procedure.
arXiv Detail & Related papers (2023-02-08T11:54:07Z)
VELVET: a noVel Ensemble Learning approach to automatically locate VulnErable sTatements [62.93814803258067]
This paper presents VELVET, a novel ensemble learning approach to locate vulnerable statements in source code. Our model combines graph-based and sequence-based neural networks to successfully capture the local and global context of a program graph. VELVET achieves 99.6% and 43.6% top-1 accuracy over synthetic data and real-world data, respectively.
arXiv Detail & Related papers (2021-12-20T22:45:27Z)
Dataset Security for Machine Learning: Data Poisoning, Backdoor Attacks, and Defenses [150.64470864162556]
This work systematically categorizes and discusses a wide range of dataset vulnerabilities and exploits. In addition to describing various poisoning and backdoor threat models and the relationships among them, we develop their unified taxonomy.
arXiv Detail & Related papers (2020-12-18T22:38:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.