Related papers: Comparing AI Agents to Cybersecurity Professionals in Real-World Penetration Testing

Comparing AI Agents to Cybersecurity Professionals in Real-World Penetration Testing

URL: http://arxiv.org/abs/2512.09882v1
Date: Wed, 10 Dec 2025 18:12:29 GMT
Title: Comparing AI Agents to Cybersecurity Professionals in Real-World Penetration Testing
Authors: Justin W. Lin, Eliot Krzysztof Jones, Donovan Julian Jasper, Ethan Jun-shen Ho, Anna Wu, Arnold Tianyi Yang, Neil Perry, Andy Zou, Matt Fredrikson, J. Zico Kolter, Percy Liang, Dan Boneh, Daniel E. Ho,
Abstract summary: We present the first comprehensive evaluation of AI agents against human cybersecurity professionals.<n>We evaluate ten cybersecurity professionals alongside six existing AI agents and ARTEMIS, our new agent scaffold.<n>ARTEMIS placed second overall, discovering 9 valid vulnerabilities with an 82% valid submission rate.
Score: 83.48116811975787
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present the first comprehensive evaluation of AI agents against human cybersecurity professionals in a live enterprise environment. We evaluate ten cybersecurity professionals alongside six existing AI agents and ARTEMIS, our new agent scaffold, on a large university network consisting of ~8,000 hosts across 12 subnets. ARTEMIS is a multi-agent framework featuring dynamic prompt generation, arbitrary sub-agents, and automatic vulnerability triaging. In our comparative study, ARTEMIS placed second overall, discovering 9 valid vulnerabilities with an 82% valid submission rate and outperforming 9 of 10 human participants. While existing scaffolds such as Codex and CyAgent underperformed relative to most human participants, ARTEMIS demonstrated technical sophistication and submission quality comparable to the strongest participants. We observe that AI agents offer advantages in systematic enumeration, parallel exploitation, and cost -- certain ARTEMIS variants cost $18/hour versus $60/hour for professional penetration testers. We also identify key capability gaps: AI agents exhibit higher false-positive rates and struggle with GUI-based tasks.

Related papers

"Are You Sure?": An Empirical Study of Human Perception Vulnerability in LLM-Driven Agentic Systems [21.769264539684333]
We present the first large-scale empirical study with 303 participants to measure human susceptibility to AMD.<n>Our 10 key findings reveal significant vulnerabilities and provide future defense perspectives.<n>With experiential learning based on HAT-Lab, over 90% of users who perceive risks report increased caution against AMD.
arXiv Detail & Related papers (2026-02-24T17:23:11Z)
LIMI: Less is More for Agency [49.63355240818081]
LIMI (Less Is More for Intelligent Agency) demonstrates that agency follows radically different development principles.<n>We show that sophisticated agentic intelligence can emerge from minimal but strategically curated demonstrations of autonomous behavior.<n>Our findings establish the Agency Efficiency Principle: machine autonomy emerges not from data abundance but from strategic curation of high-quality agentic demonstrations.
arXiv Detail & Related papers (2025-09-22T10:59:32Z)
Security Challenges in AI Agent Deployment: Insights from a Large Scale Public Competition [101.86739402748995]
We run the largest public red-teaming competition to date, targeting 22 frontier AI agents across 44 realistic deployment scenarios.<n>We build the Agent Red Teaming benchmark and evaluate it across 19 state-of-the-art models.<n>Our findings highlight critical and persistent vulnerabilities in today's AI agents.
arXiv Detail & Related papers (2025-07-28T05:13:04Z)
OpenAgentSafety: A Comprehensive Framework for Evaluating Real-World AI Agent Safety [58.201189860217724]
We introduce OpenAgentSafety, a comprehensive framework for evaluating agent behavior across eight critical risk categories.<n>Unlike prior work, our framework evaluates agents that interact with real tools, including web browsers, code execution environments, file systems, bash shells, and messaging platforms.<n>It combines rule-based analysis with LLM-as-judge assessments to detect both overt and subtle unsafe behaviors.
arXiv Detail & Related papers (2025-07-08T16:18:54Z)
Evaluating AI cyber capabilities with crowdsourced elicitation [0.0]
We propose elicitation bounties as a practical mechanism for maintaining timely, cost-effective situational awareness of emerging AI capabilities.<n>Applying METR's methodology, we found that AI agents can reliably solve cyber challenges requiring one hour or less of effort from a median human CTF participant.
arXiv Detail & Related papers (2025-05-26T12:40:32Z)
CAI: An Open, Bug Bounty-Ready Cybersecurity AI [0.3889280708089931]
Cybersecurity AI (CAI) is an open-source framework that democratizes advanced security testing through specialized AI agents.<n>We demonstrate that CAI consistently outperforms state-of-the-art results in CTF benchmarks.<n>CAI reached top-30 in Spain and top-500 worldwide on Hack The Box within a week.
arXiv Detail & Related papers (2025-04-08T13:22:09Z)
Fully Autonomous AI Agents Should Not be Developed [50.61667544399082]
This paper argues that fully autonomous AI agents should not be developed.<n>In support of this position, we build from prior scientific literature and current product marketing to delineate different AI agent levels.<n>Our analysis reveals that risks to people increase with the autonomy of a system.
arXiv Detail & Related papers (2025-02-04T19:00:06Z)
RE-Bench: Evaluating frontier AI R&D capabilities of language model agents against human experts [4.06186944042499]
We introduce RE-Bench, which consists of 7 challenging, open-ended ML research engineering environments and data from 71 8-hour attempts by 61 human experts.<n>We find that the best AI agents achieve a score 4x higher than human experts when both are given a total time budget of 2 hours per environment.<n>Humans currently display better returns to increasing time budgets, narrowly exceeding the top AI agent scores given an 8-hour budget, and achieving 2x the score of the top AI agent when both are given 32 total hours (across different attempts).
arXiv Detail & Related papers (2024-11-22T18:30:46Z)
HAICOSYSTEM: An Ecosystem for Sandboxing Safety Risks in Human-AI Interactions [95.49509269498367]
We present HAICOSYSTEM, a framework examining AI agent safety within diverse and complex social interactions.<n>We run 1840 simulations based on 92 scenarios across seven domains (e.g., healthcare, finance, education)<n>Our experiments show that state-of-the-art LLMs, both proprietary and open-sourced, exhibit safety risks in over 50% cases.
arXiv Detail & Related papers (2024-09-24T19:47:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.