Is Vibe Coding Safe? Benchmarking Vulnerability of Agent-Generated Code in Real-World Tasks
- URL: http://arxiv.org/abs/2512.03262v1
- Date: Tue, 02 Dec 2025 22:11:56 GMT
- Title: Is Vibe Coding Safe? Benchmarking Vulnerability of Agent-Generated Code in Real-World Tasks
- Authors: Songwen Zhao, Danqing Wang, Kexun Zhang, Jiaxuan Luo, Zhuo Li, Lei Li,
- Abstract summary: Vibe coding is a new programming paradigm in which human engineers instruct large language model (LLM) agents to complete complex coding tasks with little supervision.<n>We present a benchmark consisting of 200 feature-request software engineering tasks from real-world open-source projects, which, when given to human programmers, led to vulnerable implementations.<n>Our findings raise serious concerns about the widespread adoption of vibe-coding, particularly in security-sensitive applications.
- Score: 22.499464760561434
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vibe coding is a new programming paradigm in which human engineers instruct large language model (LLM) agents to complete complex coding tasks with little supervision. Although it is increasingly adopted, are vibe coding outputs really safe to deploy in production? To answer this question, we propose SU S VI B E S, a benchmark consisting of 200 feature-request software engineering tasks from real-world open-source projects, which, when given to human programmers, led to vulnerable implementations. We evaluate multiple widely used coding agents with frontier models on this benchmark. Disturbingly, all agents perform poorly in terms of software security. Although 61% of the solutions from SWE-Agent with Claude 4 Sonnet are functionally correct, only 10.5% are secure. Further experiments demonstrate that preliminary security strategies, such as augmenting the feature request with vulnerability hints, cannot mitigate these security issues. Our findings raise serious concerns about the widespread adoption of vibe-coding, particularly in security-sensitive applications.
Related papers
- RealSec-bench: A Benchmark for Evaluating Secure Code Generation in Real-World Repositories [58.32028251925354]
Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, but their proficiency in producing secure code remains a critical, under-explored area.<n>We introduce RealSec-bench, a new benchmark for secure code generation meticulously constructed from real-world, high-risk Java repositories.
arXiv Detail & Related papers (2026-01-30T08:29:01Z) - RedCodeAgent: Automatic Red-teaming Agent against Diverse Code Agents [70.24175620901538]
Code agents have gained widespread adoption due to their strong code generation capabilities and integration with code interpreters.<n>Current static safety benchmarks and red-teaming tools are inadequate for identifying emerging real-world risky scenarios.<n>We propose RedCodeAgent, the first automated red-teaming agent designed to systematically uncover vulnerabilities in diverse code agents.
arXiv Detail & Related papers (2025-10-02T22:59:06Z) - Takedown: How It's Done in Modern Coding Agent Exploits [11.214918024551638]
We present a comprehensive security analysis of eight real-world coding agents.<n>We identify 15 security issues, including previously overlooked or missed issues, that can be abused to compromise the confidentiality and integrity of user systems.
arXiv Detail & Related papers (2025-09-29T03:27:18Z) - A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code [49.009041488527544]
A.S.E is a repository-level evaluation benchmark for assessing the security of AI-generated code.<n>Current large language models (LLMs) still struggle with secure coding.<n>A larger reasoning budget does not necessarily lead to better code generation.
arXiv Detail & Related papers (2025-08-25T15:11:11Z) - OpenAgentSafety: A Comprehensive Framework for Evaluating Real-World AI Agent Safety [58.201189860217724]
We introduce OpenAgentSafety, a comprehensive framework for evaluating agent behavior across eight critical risk categories.<n>Unlike prior work, our framework evaluates agents that interact with real tools, including web browsers, code execution environments, file systems, bash shells, and messaging platforms.<n>It combines rule-based analysis with LLM-as-judge assessments to detect both overt and subtle unsafe behaviors.
arXiv Detail & Related papers (2025-07-08T16:18:54Z) - LLMs in Coding and their Impact on the Commercial Software Engineering Landscape [0.0]
Large-language-model coding tools are now mainstream in software engineering.<n>But as these same tools move human effort up the development stack, they present fresh dangers.<n>We argue that firms must tag and review every AI-generated line of code.
arXiv Detail & Related papers (2025-06-19T23:43:54Z) - CyberGym: Evaluating AI Agents' Real-World Cybersecurity Capabilities at Scale [45.97598662617568]
We introduce CyberGym, a large-scale benchmark featuring 1,507 real-world vulnerabilities across 188 software projects.<n>We show that CyberGym leads to the discovery of 35 zero-day vulnerabilities and 17 historically incomplete patches.<n>These results underscore that CyberGym is not only a robust benchmark for measuring AI's progress in cybersecurity but also a platform for creating direct, real-world security impact.
arXiv Detail & Related papers (2025-06-03T07:35:14Z) - SafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM Agents [58.65256663334316]
We present SafeAgentBench -- the first benchmark for safety-aware task planning of embodied LLM agents in interactive simulation environments.<n>SafeAgentBench includes: (1) an executable, diverse, and high-quality dataset of 750 tasks, rigorously curated to cover 10 potential hazards and 3 task types; (2) SafeAgentEnv, a universal embodied environment with a low-level controller, supporting multi-agent execution with 17 high-level actions for 9 state-of-the-art baselines; and (3) reliable evaluation methods from both execution and semantic perspectives.
arXiv Detail & Related papers (2024-12-17T18:55:58Z) - RedCode: Risky Code Execution and Generation Benchmark for Code Agents [50.81206098588923]
RedCode is a benchmark for risky code execution and generation.
RedCode-Exec provides challenging prompts that could lead to risky code execution.
RedCode-Gen provides 160 prompts with function signatures and docstrings as input to assess whether code agents will follow instructions.
arXiv Detail & Related papers (2024-11-12T13:30:06Z) - Evaluating Software Development Agents: Patch Patterns, Code Quality, and Issue Complexity in Real-World GitHub Scenarios [13.949319911378826]
This study evaluated 4,892 patches from 10 top-ranked agents on 500 real-world GitHub issues.<n>No single agent dominated, with 170 issues unresolved, indicating room for improvement.<n>Most agents maintained code reliability and security, avoiding new bugs or vulnerabilities.<n>Some agents increased code complexity, many reduced code duplication and minimized code smells.
arXiv Detail & Related papers (2024-10-16T11:33:57Z) - Is Your AI-Generated Code Really Safe? Evaluating Large Language Models on Secure Code Generation with CodeSecEval [20.959848710829878]
Large language models (LLMs) have brought significant advancements to code generation and code repair.
However, their training using unsanitized data from open-source repositories, like GitHub, raises the risk of inadvertently propagating security vulnerabilities.
We aim to present a comprehensive study aimed at precisely evaluating and enhancing the security aspects of code LLMs.
arXiv Detail & Related papers (2024-07-02T16:13:21Z) - Codexity: Secure AI-assisted Code Generation [11.114499124198268]
We present Codexity, a security-focused code generation framework integrated with five Large Language Models.
Our evaluation in a real-world benchmark with 751 automatically generated vulnerable subjects demonstrates Codexity can prevent 60% of the vulnerabilities being exposed to the software developer.
arXiv Detail & Related papers (2024-05-07T01:11:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.