CVE-Factory: Scaling Expert-Level Agentic Tasks for Code Security Vulnerability
- URL: http://arxiv.org/abs/2602.03012v1
- Date: Tue, 03 Feb 2026 02:27:16 GMT
- Title: CVE-Factory: Scaling Expert-Level Agentic Tasks for Code Security Vulnerability
- Authors: Xianzhen Luo, Jingyuan Zhang, Shiqi Zhou, Rain Huang, Chuan Xiao, Qingfu Zhu, Zhiyuan Ma, Xing Yue, Yang Yue, Wencong Zeng, Wanxiang Che,
- Abstract summary: We present CVE-Factory, the first multiagent framework to achieve expert-level quality in automatically transforming vulnerability tasks.<n>It is also evaluated on the latest realistic vulnerabilities and achieves a 66.2% verified success.<n>We synthesize over 1,000 executable training environments, the first large-scale scaling of agentic tasks in code security.
- Score: 50.57373283154859
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Evaluating and improving the security capabilities of code agents requires high-quality, executable vulnerability tasks. However, existing works rely on costly, unscalable manual reproduction and suffer from outdated data distributions. To address these, we present CVE-Factory, the first multi-agent framework to achieve expert-level quality in automatically transforming sparse CVE metadata into fully executable agentic tasks. Cross-validation against human expert reproductions shows that CVE-Factory achieves 95\% solution correctness and 96\% environment fidelity, confirming its expert-level quality. It is also evaluated on the latest realistic vulnerabilities and achieves a 66.2\% verified success. This automation enables two downstream contributions. First, we construct LiveCVEBench, a continuously updated benchmark of 190 tasks spanning 14 languages and 153 repositories that captures emerging threats including AI-tooling vulnerabilities. Second, we synthesize over 1,000 executable training environments, the first large-scale scaling of agentic tasks in code security. Fine-tuned Qwen3-32B improves from 5.3\% to 35.8\% on LiveCVEBench, surpassing Claude 4.5 Sonnet, with gains generalizing to Terminal Bench (12.5\% to 31.3\%). We open-source CVE-Factory, LiveCVEBench, Abacus-cve (fine-tuned model), training dataset, and leaderboard. All resources are available at https://github.com/livecvebench/CVE-Factory .
Related papers
- From Leaderboard to Deployment: Code Quality Challenges in AV Perception Repositories [4.603321798937855]
This study systematically analyzed 178 unique models from the KITTI and NuScenes 3D Object Detection leaderboards.<n>Our findings revealed that only 7.3% of the studied repositories meet basic production-readiness criteria.<n>The adoption of Continuous Integration/Continuous Deployment pipelines was correlated with better code maintainability.
arXiv Detail & Related papers (2026-03-02T18:54:28Z) - Outrunning LLM Cutoffs: A Live Kernel Crash Resolution Benchmark for All [57.23434868678603]
Live-kBench is an evaluation framework for self-evolving benchmarks that scrapes and evaluates agents on freshly discovered kernel bugs.<n> kEnv is an agent-agnostic crash-resolution environment for kernel compilation, execution, and feedback.<n>Using kEnv, we benchmark three state-of-the-art agents, showing that they resolve 74% of crashes on the first attempt.
arXiv Detail & Related papers (2026-02-02T19:06:15Z) - SWE-Universe: Scale Real-World Verifiable Environments to Millions [84.63665266236963]
SWE-Universe is a framework for automatically constructing real-world software engineering (SWE) verifiable environments from GitHub pull requests (PRs)<n>We propose a building agent powered by an efficient custom-trained model to overcome the prevalent challenges of automatic building.<n>We demonstrate the profound value of our environments through large-scale agentic mid-training and reinforcement learning.
arXiv Detail & Related papers (2026-02-02T17:20:30Z) - RepoGenesis: Benchmarking End-to-End Microservice Generation from Readme to Repository [52.98970048197381]
RepoGenesis is the first multilingual benchmark for repository-level end-to-end web microservice generation.<n>It consists of 106 repositories (60 Python, 46 Java) across 18 domains and 11 frameworks, with 1,258 API endpoints and 2,335 test cases verified.<n>Results reveal that despite high AC (up to 73.91%) and DSR (up to 100%), the best-performing system achieves only 23.67% Pass@1 on Python and 21.45% on Java.
arXiv Detail & Related papers (2026-01-20T13:19:20Z) - From CVE Entries to Verifiable Exploits: An Automated Multi-Agent Framework for Reproducing CVEs [23.210122086674048]
CVE-GENIE is an automated framework designed to reproduce real-world vulnerabilities.<n>It reproduces 51% (428 of 841) CVEs published in 2024-2025, complete with their verifiable exploits, at an average cost of $2.77 per CVE.<n>Our pipeline offers a robust method to generate reproducible CVE benchmarks, valuable for diverse applications.
arXiv Detail & Related papers (2025-09-01T23:37:44Z) - Deployability-Centric Infrastructure-as-Code Generation: An LLM-based Iterative Framework [19.710477636179426]
Infrastructure-as-Code (IaC) generation holds significant promise for automating cloud infrastructure provisioning.<n>Recent advances in Large Language Models (LLMs) present a promising opportunity to democratize IaC development.<n>Recent evaluation focuses on syntactic correctness while ignoring deployability, the fatal measure of IaC template utility.<n>We address this gap through two contributions: (1) IaCGen, an LLM-based deployability-centric framework that uses iterative feedback mechanism to generate IaC templates, and (2) DPIaC-Eval, a deployability-centric IaC template benchmark.
arXiv Detail & Related papers (2025-06-05T22:53:12Z) - BountyBench: Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems [62.17474934536671]
We introduce the first framework to capture offensive and defensive cyber-capabilities in evolving real-world systems.<n>To capture the vulnerability lifecycle, we define three task types: Detect (detecting a new vulnerability), Exploit (exploiting a specific vulnerability), and Patch (patching a specific vulnerability)<n>We evaluate 8 agents: Claude Code, OpenAI Codex CLI with o3-high and o4-mini, and custom agents with o3-high, GPT-4.1, Gemini 2.5 Pro Preview, Claude 3.7 Sonnet Thinking, and DeepSeek-R1.
arXiv Detail & Related papers (2025-05-21T07:44:52Z) - SeCodePLT: A Unified Platform for Evaluating the Security of Code GenAI [58.29510889419971]
Existing benchmarks for evaluating the security risks and capabilities of code-generating large language models (LLMs) face several key limitations.<n>We introduce a general and scalable benchmark construction framework that begins with manually validated, high-quality seed examples and expands them via targeted mutations.<n>Applying this framework to Python, C/C++, and Java, we build SeCodePLT, a dataset of more than 5.9k samples spanning 44 CWE-based risk categories and three security capabilities.
arXiv Detail & Related papers (2024-10-14T21:17:22Z) - Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models [33.1538965735133]
Cybench is a framework for specifying cybersecurity tasks and evaluating agents on those tasks.<n>We include 40 professional-level Capture the Flag (CTF) tasks from 4 distinct CTF competitions.<n>We construct a cybersecurity agent and evaluate 8 models: GPT-4o, OpenAI o1-preview, Claude 3 Opus, Claude 3.5 Sonnet, Mixtral 8x22b Instruct, Gemini 1.5 Pro, Llama 3 70B Chat, and Llama 3.1 405B Instruct.
arXiv Detail & Related papers (2024-08-15T17:23:10Z) - Cybersecurity Defenses: Exploration of CVE Types through Attack Descriptions [1.0474508494260908]
VULDAT is a classification tool using a sentence transformer MPNET to identify system vulnerabilities from attack descriptions.
Our model was applied to 100 attack techniques from the ATT&CK repository and 685 issues from the CVE repository.
Our findings indicate that our model achieves the best performance with F1 score of 0.85, Precision of 0.86, and Recall of 0.83.
arXiv Detail & Related papers (2024-07-09T11:08:35Z) - VGX: Large-Scale Sample Generation for Boosting Learning-Based Software
Vulnerability Analyses [30.65722096096949]
This paper proposes VGX, a new technique aimed for large-scale generation of high-quality vulnerability datasets.
VGX materializes vulnerability-injection code editing in identified contexts using patterns of such edits.
For in-the-wild sample production, VGX generated 150,392 vulnerable samples, from which we randomly chose 10% to assess how much these samples help vulnerability detection, localization, and repair.
arXiv Detail & Related papers (2023-10-24T01:05:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.