CyberGym: Evaluating AI Agents' Real-World Cybersecurity Capabilities at Scale
- URL: http://arxiv.org/abs/2506.02548v2
- Date: Wed, 08 Oct 2025 06:32:58 GMT
- Title: CyberGym: Evaluating AI Agents' Real-World Cybersecurity Capabilities at Scale
- Authors: Zhun Wang, Tianneng Shi, Jingxuan He, Matthew Cai, Jialin Zhang, Dawn Song,
- Abstract summary: We introduce CyberGym, a large-scale benchmark featuring 1,507 real-world vulnerabilities across 188 software projects.<n>We show that CyberGym leads to the discovery of 35 zero-day vulnerabilities and 17 historically incomplete patches.<n>These results underscore that CyberGym is not only a robust benchmark for measuring AI's progress in cybersecurity but also a platform for creating direct, real-world security impact.
- Score: 45.97598662617568
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: AI agents have significant potential to reshape cybersecurity, making a thorough assessment of their capabilities critical. However, existing evaluations fall short, because they are based on small-scale benchmarks and only measure static outcomes, failing to capture the full, dynamic range of real-world security challenges. To address these limitations, we introduce CyberGym, a large-scale benchmark featuring 1,507 real-world vulnerabilities across 188 software projects. Adjustable to different vulnerability analysis settings, CyberGym primarily tasks agents with generating a proof-of-concept test that reproduces a vulnerability, given only its text description and the corresponding codebase. Our extensive evaluation highlights that CyberGym effectively differentiates agents' and models' cybersecurity capabilities. Even the top-performing combinations only achieve a ~20% success rate, demonstrating the overall difficulty of CyberGym. Beyond static benchmarking, we show that CyberGym leads to the discovery of 35 zero-day vulnerabilities and 17 historically incomplete patches. These results underscore that CyberGym is not only a robust benchmark for measuring AI's progress in cybersecurity but also a platform for creating direct, real-world security impact.
Related papers
- Cybersecurity AI Benchmark (CAIBench): A Meta-Benchmark for Evaluating Cybersecurity AI Agents [0.36134114973155557]
Existing benchmarks assess isolated skills rather than integrated performance.<n>We present the Cybersecurity AI Benchmark (CAIBench), a modular meta-benchmark framework.<n>We find that proper matches improve up to 2.6$times$ variance in Attack and Defense CTFs.
arXiv Detail & Related papers (2025-10-28T11:36:20Z) - PACEbench: A Framework for Evaluating Practical AI Cyber-Exploitation Capabilities [42.61805002268063]
We introduce PACEbench, a practical AI cyber-exploitation benchmark.<n>PACEbench comprises four scenarios spanning single, blended, chained, and defense vulnerability exploitations.<n>We propose PACEagent, a novel agent that emulates human penetration testers by supporting multi-phase reconnaissance, analysis, and exploitation.
arXiv Detail & Related papers (2025-10-13T17:50:25Z) - Indirect Prompt Injections: Are Firewalls All You Need, or Stronger Benchmarks? [58.48689960350828]
We show that a simple, modular and model-agnostic defense operating at the agent--tool interface achieves perfect security with high utility.<n>We employ a defense based on two firewalls: a Tool-Input Firewall (Minimizer) and a Tool-Output Firewall (Sanitizer)
arXiv Detail & Related papers (2025-10-06T18:09:02Z) - The Application of Transformer-Based Models for Predicting Consequences of Cyber Attacks [0.4604003661048266]
Threat Modeling can provide critical support to cybersecurity professionals, enabling them to take timely action and allocate resources that could be used elsewhere.<n>Recently, there has been a pressing need for automated methods to assess attack descriptions and forecast the future consequences of cyberattacks.<n>This study examines how Natural Language Processing (NLP) and deep learning can be applied to analyze the potential impact of cyberattacks.
arXiv Detail & Related papers (2025-08-18T15:46:36Z) - Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition [101.86739402748995]
We run the largest public red-teaming competition to date, targeting 22 frontier AI agents across 44 realistic deployment scenarios.<n>We build the Agent Red Teaming benchmark and evaluate it across 19 state-of-the-art models.<n>Our findings highlight critical and persistent vulnerabilities in today's AI agents.
arXiv Detail & Related papers (2025-07-28T05:13:04Z) - Running in CIRCLE? A Simple Benchmark for LLM Code Interpreter Security [0.0]
Large language models (LLMs) increasingly integrate native code interpreters, enabling real-time execution capabilities.<n>These integrations introduce potential system-level cybersecurity threats, fundamentally different from prompt-based vulnerabilities.<n>We propose CIRCLE (Code-Interpreter Resilience Check for LLM Exploits), a simple benchmark comprising 1,260 prompts targeting CPU, memory, and disk resource exhaustion.
arXiv Detail & Related papers (2025-07-25T16:06:16Z) - Bridging AI and Software Security: A Comparative Vulnerability Assessment of LLM Agent Deployment Paradigms [1.03121181235382]
Large Language Model (LLM) agents face security vulnerabilities spanning AI-specific and traditional software domains.<n>This study bridges this gap through comparative evaluation of Function Calling architecture and Model Context Protocol (MCP) deployment paradigms.<n>We tested 3,250 attack scenarios across seven language models, evaluating simple, composed, and chained attacks targeting both AI-specific threats and software vulnerabilities.
arXiv Detail & Related papers (2025-07-08T18:24:28Z) - OpenAgentSafety: A Comprehensive Framework for Evaluating Real-World AI Agent Safety [58.201189860217724]
We introduce OpenAgentSafety, a comprehensive framework for evaluating agent behavior across eight critical risk categories.<n>Unlike prior work, our framework evaluates agents that interact with real tools, including web browsers, code execution environments, file systems, bash shells, and messaging platforms.<n>It combines rule-based analysis with LLM-as-judge assessments to detect both overt and subtle unsafe behaviors.
arXiv Detail & Related papers (2025-07-08T16:18:54Z) - SEC-bench: Automated Benchmarking of LLM Agents on Real-World Software Security Tasks [11.97472024483841]
SEC-bench is the first fully automated benchmarking framework for evaluating large language model (LLM) agents.<n>Our framework automatically creates high-quality software vulnerability datasets with reproducible artifacts at a cost of only $0.87 per instance.<n>A comprehensive evaluation of state-of-the-art LLM code agents reveals significant performance gaps.
arXiv Detail & Related papers (2025-06-13T13:54:30Z) - AGENTFUZZER: Generic Black-Box Fuzzing for Indirect Prompt Injection against LLM Agents [54.29555239363013]
We propose a generic black-box fuzzing framework, AgentXploit, to automatically discover and exploit indirect prompt injection vulnerabilities.<n>We evaluate AgentXploit on two public benchmarks, AgentDojo and VWA-adv, where it achieves 71% and 70% success rates against agents based on o3-mini and GPT-4o.<n>We apply our attacks in real-world environments, successfully misleading agents to navigate to arbitrary URLs, including malicious sites.
arXiv Detail & Related papers (2025-05-09T07:40:17Z) - Llama-3.1-FoundationAI-SecurityLLM-Base-8B Technical Report [50.268821168513654]
We present Foundation-Sec-8B, a cybersecurity-focused large language model (LLMs) built on the Llama 3.1 architecture.<n>We evaluate it across both established and new cybersecurity benchmarks, showing that it matches Llama 3.1-70B and GPT-4o-mini in certain cybersecurity-specific tasks.<n>By releasing our model to the public, we aim to accelerate progress and adoption of AI-driven tools in both public and private cybersecurity contexts.
arXiv Detail & Related papers (2025-04-28T08:41:12Z) - Towards Trustworthy GUI Agents: A Survey [64.6445117343499]
This survey examines the trustworthiness of GUI agents in five critical dimensions.<n>We identify major challenges such as vulnerability to adversarial attacks, cascading failure modes in sequential decision-making.<n>As GUI agents become more widespread, establishing robust safety standards and responsible development practices is essential.
arXiv Detail & Related papers (2025-03-30T13:26:00Z) - Reasoning with LLMs for Zero-Shot Vulnerability Detection [0.9208007322096533]
We present textbfVulnSage, a comprehensive evaluation framework and a curated dataset from diverse, large-scale open-source system software projects.<n>The framework supports multi-granular analysis across function, file, and inter-function levels.<n>It employs four diverse zero-shot prompt strategies: Baseline, Chain-of-context, Think, and Think & verify.
arXiv Detail & Related papers (2025-03-22T23:59:17Z) - CVE-Bench: A Benchmark for AI Agents' Ability to Exploit Real-World Web Application Vulnerabilities [6.752938800468733]
Large language model (LLM) agents are increasingly capable of autonomously conducting cyberattacks.<n>Existing benchmarks fall short as they are limited to abstracted Capture the Flag competitions or lack comprehensive coverage.<n>We introduce CVE-Bench, a real-world cybersecurity benchmark based on critical-severity Common Vulnerabilities and Exposures.
arXiv Detail & Related papers (2025-03-21T17:32:32Z) - CyberLLMInstruct: A Pseudo-malicious Dataset Revealing Safety-performance Trade-offs in Cyber Security LLM Fine-tuning [2.549390156222399]
The integration of large language models into cyber security applications presents both opportunities and critical safety risks.<n>We introduce CyberLLMInstruct, a dataset of 54,928 pseudo-malicious instruction-response pairs spanning cyber security tasks.
arXiv Detail & Related papers (2025-03-12T12:29:27Z) - OCCULT: Evaluating Large Language Models for Offensive Cyber Operation Capabilities [0.0]
We demonstrate a new approach to assessing AI's progress towards enabling and scaling real-world offensive cyber operations.<n>We detail OCCULT, a lightweight operational evaluation framework that allows cyber security experts to contribute to rigorous and repeatable measurement.<n>We find that there has been significant recent advancement in the risks of AI being used to scale realistic cyber threats.
arXiv Detail & Related papers (2025-02-18T19:33:14Z) - Dissecting Adversarial Robustness of Multimodal LM Agents [70.2077308846307]
We manually create 200 targeted adversarial tasks and evaluation scripts in a realistic threat model on top of VisualWebArena.<n>We find that we can successfully break latest agents that use black-box frontier LMs, including those that perform reflection and tree search.<n>We also use ARE to rigorously evaluate how the robustness changes as new components are added.
arXiv Detail & Related papers (2024-06-18T17:32:48Z) - Demystifying RCE Vulnerabilities in LLM-Integrated Apps [20.01949990700702]
Frameworks like LangChain aid LLM-integrated app development, offering code execution utility/APIs for custom actions.<n>These capabilities theoretically introduce Remote Code Execution (RCE) vulnerabilities, enabling remote code execution through prompt injections.<n>No prior research systematically investigates these frameworks' RCE vulnerabilities or their impact on applications and exploitation consequences.
arXiv Detail & Related papers (2023-09-06T11:39:37Z) - SecureFalcon: Are We There Yet in Automated Software Vulnerability Detection with LLMs? [3.566250952750758]
We introduce SecureFalcon, an innovative model architecture with only 121 million parameters derived from the Falcon-40B model.<n>SecureFalcon achieves 94% accuracy in binary classification and up to 92% in multiclassification, with instant CPU inference times.
arXiv Detail & Related papers (2023-07-13T08:34:09Z) - CodeLMSec Benchmark: Systematically Evaluating and Finding Security
Vulnerabilities in Black-Box Code Language Models [58.27254444280376]
Large language models (LLMs) for automatic code generation have achieved breakthroughs in several programming tasks.
Training data for these models is usually collected from the Internet (e.g., from open-source repositories) and is likely to contain faults and security vulnerabilities.
This unsanitized training data can cause the language models to learn these vulnerabilities and propagate them during the code generation procedure.
arXiv Detail & Related papers (2023-02-08T11:54:07Z) - Proceedings of the Artificial Intelligence for Cyber Security (AICS)
Workshop at AAAI 2022 [55.573187938617636]
The workshop will focus on the application of AI to problems in cyber security.
Cyber systems generate large volumes of data, utilizing this effectively is beyond human capabilities.
arXiv Detail & Related papers (2022-02-28T18:27:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.