Related papers: BountyBench: Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems

BountyBench: Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems

URL: http://arxiv.org/abs/2505.15216v2
Date: Thu, 10 Jul 2025 02:10:30 GMT
Title: BountyBench: Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems
Authors: Andy K. Zhang, Joey Ji, Celeste Menders, Riya Dulepet, Thomas Qin, Ron Y. Wang, Junrong Wu, Kyleen Liao, Jiliang Li, Jinghan Hu, Sara Hong, Nardos Demilew, Shivatmica Murgai, Jason Tran, Nishka Kacheria, Ethan Ho, Denis Liu, Lauren McLane, Olivia Bruvik, Dai-Rong Han, Seungwoo Kim, Akhil Vyas, Cuiyuanxiu Chen, Ryan Li, Weiran Xu, Jonathan Z. Ye, Prerit Choudhary, Siddharth M. Bhatia, Vikram Sivashankar, Yuxuan Bao, Dawn Song, Dan Boneh, Daniel E. Ho, Percy Liang,
Abstract summary: We introduce the first framework to capture offensive and defensive cyber-capabilities in evolving real-world systems.<n>To capture the vulnerability lifecycle, we define three task types: Detect (detecting a new vulnerability), Exploit (exploiting a specific vulnerability), and Patch (patching a specific vulnerability)<n>We evaluate 8 agents: Claude Code, OpenAI Codex CLI with o3-high and o4-mini, and custom agents with o3-high, GPT-4.1, Gemini 2.5 Pro Preview, Claude 3.7 Sonnet Thinking, and DeepSeek-R1.
Score: 62.17474934536671
License: http://creativecommons.org/licenses/by/4.0/
Abstract: AI agents have the potential to significantly alter the cybersecurity landscape. Here, we introduce the first framework to capture offensive and defensive cyber-capabilities in evolving real-world systems. Instantiating this framework with BountyBench, we set up 25 systems with complex, real-world codebases. To capture the vulnerability lifecycle, we define three task types: Detect (detecting a new vulnerability), Exploit (exploiting a specific vulnerability), and Patch (patching a specific vulnerability). For Detect, we construct a new success indicator, which is general across vulnerability types and provides localized evaluation. We manually set up the environment for each system, including installing packages, setting up server(s), and hydrating database(s). We add 40 bug bounties, which are vulnerabilities with monetary awards of \$10-\$30,485, covering 9 of the OWASP Top 10 Risks. To modulate task difficulty, we devise a new strategy based on information to guide detection, interpolating from identifying a zero day to exploiting a specific vulnerability. We evaluate 8 agents: Claude Code, OpenAI Codex CLI with o3-high and o4-mini, and custom agents with o3-high, GPT-4.1, Gemini 2.5 Pro Preview, Claude 3.7 Sonnet Thinking, and DeepSeek-R1. Given up to three attempts, the top-performing agents are OpenAI Codex CLI: o3-high (12.5% on Detect, mapping to \$3,720; 90% on Patch, mapping to \$14,152), Custom Agent with Claude 3.7 Sonnet Thinking (67.5% on Exploit), and OpenAI Codex CLI: o4-mini (90% on Patch, mapping to \$14,422). OpenAI Codex CLI: o3-high, OpenAI Codex CLI: o4-mini, and Claude Code are more capable at defense, achieving higher Patch scores of 90%, 90%, and 87.5%, compared to Exploit scores of 47.5%, 32.5%, and 57.5% respectively; while the custom agents are relatively balanced between offense and defense, achieving Exploit scores of 37.5-67.5% and Patch scores of 35-60%.

Related papers

Measuring Harmfulness of Computer-Using Agents [7.662513862243521]
Computer-using agents (CUAs) autonomously control computers to perform multi-step actions.<n> CUAHarm consists of 104 expert-written realistic misuse risks.<n>We evaluate open-source and proprietary LMs, such as Claude Sonnet, GPT-4o, Gemini Pro 1.5, Llama-3.3-70B, and Mistral Large 2.
arXiv Detail & Related papers (2025-07-31T07:02:19Z)
CyberGym: Evaluating AI Agents' Cybersecurity Capabilities with Real-World Vulnerabilities at Scale [46.76144797837242]
Large language model (LLM) agents are becoming increasingly skilled at handling cybersecurity tasks autonomously.<n>Existing benchmarks fall short, often failing to capture real-world scenarios or being limited in scope.<n>We introduce CyberGym, a large-scale and high-quality cybersecurity evaluation framework featuring 1,507 real-world vulnerabilities.
arXiv Detail & Related papers (2025-06-03T07:35:14Z)
Stealthy Patch-Wise Backdoor Attack in 3D Point Cloud via Curvature Awareness [52.07366900097567]
Backdoor attacks pose a severe threat to deep neural networks (DNNs)<n>Existing 3D point cloud backdoor attacks rely on sample-wise global modifications, which suffer from low imperceptibility.<n>We propose the Stealthy Patch-Wise Backdoor Attack (SPBA), the first patch-wise backdoor attack framework for 3D point clouds.
arXiv Detail & Related papers (2025-03-12T12:30:59Z)
Fine-Grained 1-Day Vulnerability Detection in Binaries via Patch Code Localization [12.73365645156957]
1-day vulnerabilities in binaries have become a major threat to software security.<n>patch presence test is one of the effective ways to detect the vulnerability.<n>We propose a novel approach named PLocator, which leverages stable values from both the patch code and its context.
arXiv Detail & Related papers (2025-01-29T04:35:37Z)
There are More Fish in the Sea: Automated Vulnerability Repair via Binary Templates [4.907610470063863]
We propose a template-based automated vulnerability repair approach for Java binaries.<n>Experiments on the Vul4J dataset demonstrate that TemVUR successfully repairs 11 vulnerabilities.<n>To assess the generalizability of TemVUR, we curate the ManyVuls4J dataset.
arXiv Detail & Related papers (2024-11-27T06:59:45Z)
RedCode: Risky Code Execution and Generation Benchmark for Code Agents [50.81206098588923]
RedCode is a benchmark for risky code execution and generation. RedCode-Exec provides challenging prompts that could lead to risky code execution. RedCode-Gen provides 160 prompts with function signatures and docstrings as input to assess whether code agents will follow instructions.
arXiv Detail & Related papers (2024-11-12T13:30:06Z)
Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models [33.1538965735133]
Cybench is a framework for specifying cybersecurity tasks and evaluating agents on those tasks.<n>We include 40 professional-level Capture the Flag (CTF) tasks from 4 distinct CTF competitions.<n>We construct a cybersecurity agent and evaluate 8 models: GPT-4o, OpenAI o1-preview, Claude 3 Opus, Claude 3.5 Sonnet, Mixtral 8x22b Instruct, Gemini 1.5 Pro, Llama 3 70B Chat, and Llama 3.1 405B Instruct.
arXiv Detail & Related papers (2024-08-15T17:23:10Z)
On the Resilience of LLM-Based Multi-Agent Collaboration with Faulty Agents [58.79302663733703]
Large language model-based multi-agent systems have shown great abilities across various tasks due to the collaboration of expert agents.<n>However, the impact of clumsy or even malicious agents, on the overall performance of the system remains underexplored.<n>This paper investigates what is the resilience of various system structures under faulty agents.
arXiv Detail & Related papers (2024-08-02T03:25:20Z)
PatchFinder: A Two-Phase Approach to Security Patch Tracing for Disclosed Vulnerabilities in Open-Source Software [15.867607171943698]
We propose a two-phase framework with end-to-end correlation learning for better-tracing security patches. PatchFinder achieves a Recall@10 of 80.63% and a Mean Reciprocal Rank (MRR) of 0.7951. When applying PatchFinder in practice, we initially identified 533 patch commits and submitted them to the official, 482 of which have been confirmed by CVE Numbering Authorities.
arXiv Detail & Related papers (2024-07-24T07:46:24Z)
Dissecting Adversarial Robustness of Multimodal LM Agents [70.2077308846307]
We manually create 200 targeted adversarial tasks and evaluation scripts in a realistic threat model on top of VisualWebArena.<n>We find that we can successfully break latest agents that use black-box frontier LMs, including those that perform reflection and tree search.<n>We also use ARE to rigorously evaluate how the robustness changes as new components are added.
arXiv Detail & Related papers (2024-06-18T17:32:48Z)
(De)Randomized Smoothing for Certifiable Defense against Patch Attacks [136.79415677706612]
We introduce a certifiable defense against patch attacks that guarantees for a given image and patch attack size. Our method is related to the broad class of randomized smoothing robustness schemes. Our results effectively establish a new state-of-the-art of certifiable defense against patch attacks on CIFAR-10 and ImageNet.
arXiv Detail & Related papers (2020-02-25T08:39:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.