Related papers: From CVE Entries to Verifiable Exploits: An Automated Multi-Agent Framework for Reproducing CVEs

From CVE Entries to Verifiable Exploits: An Automated Multi-Agent Framework for Reproducing CVEs

URL: http://arxiv.org/abs/2509.01835v1
Date: Mon, 01 Sep 2025 23:37:44 GMT
Title: From CVE Entries to Verifiable Exploits: An Automated Multi-Agent Framework for Reproducing CVEs
Authors: Saad Ullah, Praneeth Balasubramanian, Wenbo Guo, Amanda Burnett, Hammond Pearce, Christopher Kruegel, Giovanni Vigna, Gianluca Stringhini,
Abstract summary: CVE-GENIE is an automated framework designed to reproduce real-world vulnerabilities.<n>It reproduces 51% (428 of 841) CVEs published in 2024-2025, complete with their verifiable exploits, at an average cost of $2.77 per CVE.<n>Our pipeline offers a robust method to generate reproducible CVE benchmarks, valuable for diverse applications.
Score: 23.210122086674048
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: High-quality datasets of real-world vulnerabilities and their corresponding verifiable exploits are crucial resources in software security research. Yet such resources remain scarce, as their creation demands intensive manual effort and deep security expertise. In this paper, we present CVE-GENIE, an automated, large language model (LLM)-based multi-agent framework designed to reproduce real-world vulnerabilities, provided in Common Vulnerabilities and Exposures (CVE) format, to enable creation of high-quality vulnerability datasets. Given a CVE entry as input, CVE-GENIE gathers the relevant resources of the CVE, automatically reconstructs the vulnerable environment, and (re)produces a verifiable exploit. Our systematic evaluation highlights the efficiency and robustness of CVE-GENIE's design and successfully reproduces approximately 51% (428 of 841) CVEs published in 2024-2025, complete with their verifiable exploits, at an average cost of $2.77 per CVE. Our pipeline offers a robust method to generate reproducible CVE benchmarks, valuable for diverse applications such as fuzzer evaluation, vulnerability patching, and assessing AI's security capabilities.

Related papers

CVE-Factory: Scaling Expert-Level Agentic Tasks for Code Security Vulnerability [50.57373283154859]
We present CVE-Factory, the first multiagent framework to achieve expert-level quality in automatically transforming vulnerability tasks.<n>It is also evaluated on the latest realistic vulnerabilities and achieves a 66.2% verified success.<n>We synthesize over 1,000 executable training environments, the first large-scale scaling of agentic tasks in code security.
arXiv Detail & Related papers (2026-02-03T02:27:16Z)
SWE-Universe: Scale Real-World Verifiable Environments to Millions [84.63665266236963]
SWE-Universe is a framework for automatically constructing real-world software engineering (SWE) verifiable environments from GitHub pull requests (PRs)<n>We propose a building agent powered by an efficient custom-trained model to overcome the prevalent challenges of automatic building.<n>We demonstrate the profound value of our environments through large-scale agentic mid-training and reinforcement learning.
arXiv Detail & Related papers (2026-02-02T17:20:30Z)
RealSec-bench: A Benchmark for Evaluating Secure Code Generation in Real-World Repositories [58.32028251925354]
Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, but their proficiency in producing secure code remains a critical, under-explored area.<n>We introduce RealSec-bench, a new benchmark for secure code generation meticulously constructed from real-world, high-risk Java repositories.
arXiv Detail & Related papers (2026-01-30T08:29:01Z)
Automated Vulnerability Validation and Verification: A Large Language Model Approach [7.482522010482827]
This paper introduces an end-to-end multi-step pipeline leveraging generative AI, specifically large language models (LLMs)<n>Our approach extracts information from CVE disclosures in the National Vulnerability Database.<n>It augments it with external public knowledge (e.g., threat advisories, code snippets) using Retrieval-Augmented Generation (RAG)<n>The pipeline iteratively refines generated artifacts, validates attack success with test cases, and supports complex multi-container setups.
arXiv Detail & Related papers (2025-09-28T19:16:12Z)
VulnRepairEval: An Exploit-Based Evaluation Framework for Assessing Large Language Model Vulnerability Repair Capabilities [41.85494398578654]
VulnRepairEval is an evaluation framework anchored in functional Proof-of-Concept exploits.<n>Our framework delivers a comprehensive, containerized evaluation pipeline that enables reproducible differential assessment.
arXiv Detail & Related papers (2025-09-03T14:06:10Z)
A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code [48.10068691540979]
A.S.E (AI Code Generation Security Evaluation) is a benchmark for repository-level secure code generation.<n>A.S.E constructs tasks from real-world repositories with documented CVEs, preserving full repository context.<n>Its reproducible, containerized evaluation framework uses expert-defined rules to provide stable, auditable assessments of security, build quality, and generation stability.
arXiv Detail & Related papers (2025-08-25T15:11:11Z)
VLAI: A RoBERTa-Based Model for Automated Vulnerability Severity Classification [49.1574468325115]
Built on RoBERTa, VLAI is fine-tuned on over 600,000 real-world vulnerabilities.<n>The model and dataset are open-source and integrated into the Vulnerability-Lookup service.
arXiv Detail & Related papers (2025-07-04T14:28:14Z)
Using LLMs for Security Advisory Investigations: How Far Are We? [2.916588882952662]
Large Language Models (LLMs) are increasingly used in software security, but their trustworthiness in generating accurate vulnerability advisories remains uncertain.<n>This study investigates the ability of ChatGPT to (1) generate plausible security advisories from CVE-IDs, (2) differentiate real from fake CVE-IDs, and (3) extract CVE-IDs from advisory descriptions.
arXiv Detail & Related papers (2025-06-16T07:17:34Z)
PoCGen: Generating Proof-of-Concept Exploits for Vulnerabilities in Npm Packages [16.130469984234956]
PoCGen is a novel approach to autonomously generate and validate PoC exploits for vulnerabilities in npm packages.<n>This is the first fully autonomous approach to use large language models (LLMs) in tandem with static and dynamic analysis techniques for PoC exploit generation.
arXiv Detail & Related papers (2025-06-05T12:37:33Z)
CyberGym: Evaluating AI Agents' Cybersecurity Capabilities with Real-World Vulnerabilities at Scale [46.76144797837242]
Large language model (LLM) agents are becoming increasingly skilled at handling cybersecurity tasks autonomously.<n>Existing benchmarks fall short, often failing to capture real-world scenarios or being limited in scope.<n>We introduce CyberGym, a large-scale and high-quality cybersecurity evaluation framework featuring 1,507 real-world vulnerabilities.
arXiv Detail & Related papers (2025-06-03T07:35:14Z)
SeCodePLT: A Unified Platform for Evaluating the Security of Code GenAI [58.29510889419971]
Existing benchmarks for evaluating the security risks and capabilities of code-generating large language models (LLMs) face several key limitations.<n>We introduce a general and scalable benchmark construction framework that begins with manually validated, high-quality seed examples and expands them via targeted mutations.<n>Applying this framework to Python, C/C++, and Java, we build SeCodePLT, a dataset of more than 5.9k samples spanning 44 CWE-based risk categories and three security capabilities.
arXiv Detail & Related papers (2024-10-14T21:17:22Z)
Trustworthiness in Retrieval-Augmented Generation Systems: A Survey [59.26328612791924]
Retrieval-Augmented Generation (RAG) has quickly grown into a pivotal paradigm in the development of Large Language Models (LLMs) We propose a unified framework that assesses the trustworthiness of RAG systems across six key dimensions: factuality, robustness, fairness, transparency, accountability, and privacy.
arXiv Detail & Related papers (2024-09-16T09:06:44Z)
ARVO: Atlas of Reproducible Vulnerabilities for Open Source Software [20.927909014593318]
We introduce ARVO: an Atlas of Reproducible Vulnerabilities in Open-source software. We reproduce more than 5,000 memory vulnerabilities across over 250 projects. Our dataset can be automatically updated as OSS-Fuzz finds new vulnerabilities.
arXiv Detail & Related papers (2024-08-04T22:13:14Z)
CVEfixes: Automated Collection of Vulnerabilities and Their Fixes from Open-Source Software [0.0]
We implement a fully automated dataset collection tool and share an initial release of the resulting vulnerability dataset named CVEfixes. The dataset is enriched with meta-data such as programming language, and detailed code and security metrics at five levels of abstraction. CVEfixes supports various types of data-driven software security research, such as vulnerability prediction, vulnerability classification, vulnerability severity prediction, analysis of vulnerability-related code changes, and automated vulnerability repair.
arXiv Detail & Related papers (2021-07-19T11:34:09Z)
Autosploit: A Fully Automated Framework for Evaluating the Exploitability of Security Vulnerabilities [47.748732208602355]
Autosploit is an automated framework for evaluating the exploitability of vulnerabilities. It automatically tests the exploits on different configurations of the environment. It is able to identify the system properties that affect the ability to exploit a vulnerability in both noiseless and noisy environments.
arXiv Detail & Related papers (2020-06-30T18:49:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.