LLM Agents for Automated Web Vulnerability Reproduction: Are We There Yet?
- URL: http://arxiv.org/abs/2510.14700v1
- Date: Thu, 16 Oct 2025 14:04:46 GMT
- Title: LLM Agents for Automated Web Vulnerability Reproduction: Are We There Yet?
- Authors: Bin Liu, Yanjie Zhao, Guoai Xu, Haoyu Wang,
- Abstract summary: Large language model (LLM) agents have demonstrated remarkable capabilities in software engineering and cybersecurity tasks.<n>Recent advances suggest promising potential, but challenges remain in applying LLM agents to real-world web vulnerability reproduction scenarios.<n>This paper presents the first comprehensive evaluation of state-of-the-art LLM agents for automated web vulnerability reproduction.
- Score: 9.817896112083647
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language model (LLM) agents have demonstrated remarkable capabilities in software engineering and cybersecurity tasks, including code generation, vulnerability discovery, and automated testing. One critical but underexplored application is automated web vulnerability reproduction, which transforms vulnerability reports into working exploits. Although recent advances suggest promising potential, challenges remain in applying LLM agents to real-world web vulnerability reproduction scenarios. In this paper, we present the first comprehensive evaluation of state-of-the-art LLM agents for automated web vulnerability reproduction. We systematically assess 20 agents from software engineering, cybersecurity, and general domains across 16 dimensions, including technical capabilities, environment adaptability, and user experience factors, on 3 representative web vulnerabilities. Based on the results, we select three top-performing agents (OpenHands, SWE-agent, and CAI) for in-depth evaluation on our benchmark dataset of 80 real-world CVEs spanning 7 vulnerability types and 6 web technologies. Our results reveal that while LLM agents achieve reasonable success on simple library-based vulnerabilities, they consistently fail on complex service-based vulnerabilities requiring multi-component environments. Complex environment configurations and authentication barriers create a gap where agents can execute exploit code but fail to trigger actual vulnerabilities. We observe high sensitivity to input guidance, with performance degrading by over 33% under incomplete authentication information. Our findings highlight the significant gap between current LLM agent capabilities and the demands of reliable automated vulnerability reproduction, emphasizing the need for advances in environmental adaptation and autonomous problem-solving capabilities.
Related papers
- VulnResolver: A Hybrid Agent Framework for LLM-Based Automated Vulnerability Issue Resolution [27.16762667503862]
VulnResolver is the first hybrid agent framework for automated vulnerability issue resolution.<n>It unites the adaptability of autonomous agents with the stability of workflow-guided repair through two specialized agents.<n>VulnResolver resolves 75% of issues on SEC-bench Lite, achieving the best resolution performance.
arXiv Detail & Related papers (2026-01-20T13:09:16Z) - Automated Vulnerability Validation and Verification: A Large Language Model Approach [7.482522010482827]
This paper introduces an end-to-end multi-step pipeline leveraging generative AI, specifically large language models (LLMs)<n>Our approach extracts information from CVE disclosures in the National Vulnerability Database.<n>It augments it with external public knowledge (e.g., threat advisories, code snippets) using Retrieval-Augmented Generation (RAG)<n>The pipeline iteratively refines generated artifacts, validates attack success with test cases, and supports complex multi-container setups.
arXiv Detail & Related papers (2025-09-28T19:16:12Z) - SecureAgentBench: Benchmarking Secure Code Generation under Realistic Vulnerability Scenarios [17.276786247873613]
SecureAgentBench is a benchmark of 105 coding tasks designed to rigorously evaluate code agents' capabilities in secure code generation.<n>Results show that (i) current agents struggle to produce secure code, as even the best-performing one, SWE-agent supported by DeepSeek-V3.1, achieves merely 15.2% correct-and-secure solutions.
arXiv Detail & Related papers (2025-09-26T09:18:57Z) - OpenAgentSafety: A Comprehensive Framework for Evaluating Real-World AI Agent Safety [58.201189860217724]
We introduce OpenAgentSafety, a comprehensive framework for evaluating agent behavior across eight critical risk categories.<n>Unlike prior work, our framework evaluates agents that interact with real tools, including web browsers, code execution environments, file systems, bash shells, and messaging platforms.<n>It combines rule-based analysis with LLM-as-judge assessments to detect both overt and subtle unsafe behaviors.
arXiv Detail & Related papers (2025-07-08T16:18:54Z) - SEC-bench: Automated Benchmarking of LLM Agents on Real-World Software Security Tasks [11.97472024483841]
SEC-bench is the first fully automated benchmarking framework for evaluating large language model (LLM) agents.<n>Our framework automatically creates high-quality software vulnerability datasets with reproducible artifacts at a cost of only $0.87 per instance.<n>A comprehensive evaluation of state-of-the-art LLM code agents reveals significant performance gaps.
arXiv Detail & Related papers (2025-06-13T13:54:30Z) - CyberGym: Evaluating AI Agents' Real-World Cybersecurity Capabilities at Scale [45.97598662617568]
We introduce CyberGym, a large-scale benchmark featuring 1,507 real-world vulnerabilities across 188 software projects.<n>We show that CyberGym leads to the discovery of 35 zero-day vulnerabilities and 17 historically incomplete patches.<n>These results underscore that CyberGym is not only a robust benchmark for measuring AI's progress in cybersecurity but also a platform for creating direct, real-world security impact.
arXiv Detail & Related papers (2025-06-03T07:35:14Z) - SafeAgent: Safeguarding LLM Agents via an Automated Risk Simulator [77.86600052899156]
Large Language Model (LLM)-based agents are increasingly deployed in real-world applications.<n>We propose AutoSafe, the first framework that systematically enhances agent safety through fully automated synthetic data generation.<n>We show that AutoSafe boosts safety scores by 45% on average and achieves a 28.91% improvement on real-world tasks.
arXiv Detail & Related papers (2025-05-23T10:56:06Z) - AgentVigil: Generic Black-Box Red-teaming for Indirect Prompt Injection against LLM Agents [54.29555239363013]
We propose a generic black-box fuzzing framework, AgentVigil, to automatically discover and exploit indirect prompt injection vulnerabilities.<n>We evaluate AgentVigil on two public benchmarks, AgentDojo and VWA-adv, where it achieves 71% and 70% success rates against agents based on o3-mini and GPT-4o.<n>We apply our attacks in real-world environments, successfully misleading agents to navigate to arbitrary URLs, including malicious sites.
arXiv Detail & Related papers (2025-05-09T07:40:17Z) - Towards Trustworthy GUI Agents: A Survey [64.6445117343499]
This survey examines the trustworthiness of GUI agents in five critical dimensions.<n>We identify major challenges such as vulnerability to adversarial attacks, cascading failure modes in sequential decision-making.<n>As GUI agents become more widespread, establishing robust safety standards and responsible development practices is essential.
arXiv Detail & Related papers (2025-03-30T13:26:00Z) - CVE-Bench: A Benchmark for AI Agents' Ability to Exploit Real-World Web Application Vulnerabilities [6.752938800468733]
Large language model (LLM) agents are increasingly capable of autonomously conducting cyberattacks.<n>Existing benchmarks fall short as they are limited to abstracted Capture the Flag competitions or lack comprehensive coverage.<n>We introduce CVE-Bench, a real-world cybersecurity benchmark based on critical-severity Common Vulnerabilities and Exposures.
arXiv Detail & Related papers (2025-03-21T17:32:32Z) - Why Are Web AI Agents More Vulnerable Than Standalone LLMs? A Security Analysis [47.34614558636679]
This study investigates the underlying factors that contribute to the increased vulnerability of Web AI agents.<n>We identify three critical factors that amplify the vulnerability of Web AI agents; (1) embedding user goals into the system prompt, (2) multi-step action generation, and (3) observational capabilities.
arXiv Detail & Related papers (2025-02-27T18:56:26Z) - AutoPT: How Far Are We from the End2End Automated Web Penetration Testing? [54.65079443902714]
We introduce AutoPT, an automated penetration testing agent based on the principle of PSM driven by LLMs.
Our results show that AutoPT outperforms the baseline framework ReAct on the GPT-4o mini model.
arXiv Detail & Related papers (2024-11-02T13:24:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.