A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code
- URL: http://arxiv.org/abs/2508.18106v3
- Date: Thu, 18 Sep 2025 15:18:10 GMT
- Title: A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code
- Authors: Keke Lian, Bin Wang, Lei Zhang, Libo Chen, Junjie Wang, Ziming Zhao, Yujiu Yang, Miaoqian Lin, Haotong Duan, Haoran Zhao, Shuang Liao, Mingda Guo, Jiazheng Quan, Yilu Zhong, Chenhao He, Zichuan Chen, Jie Wu, Haoling Li, Zhaoxuan Li, Jiongchi Yu, Hui Li, Dong Zhang,
- Abstract summary: A.S.E is a repository-level evaluation benchmark for assessing the security of AI-generated code.<n>Current large language models (LLMs) still struggle with secure coding.<n>A larger reasoning budget does not necessarily lead to better code generation.
- Score: 49.009041488527544
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The increasing adoption of large language models (LLMs) in software engineering necessitates rigorous security evaluation of their generated code. However, existing benchmarks often lack relevance to real-world AI-assisted programming scenarios, making them inadequate for assessing the practical security risks associated with AI-generated code in production environments. To address this gap, we introduce A.S.E (AI Code Generation Security Evaluation), a repository-level evaluation benchmark designed to closely mirror real-world AI programming tasks, offering a comprehensive and reliable framework for assessing the security of AI-generated code. Our evaluation of leading LLMs on A.S.E reveals several key findings. In particular, current LLMs still struggle with secure coding. The complexity in repository-level scenarios presents challenges for LLMs that typically perform well on snippet-level tasks. Moreover, a larger reasoning budget does not necessarily lead to better code generation. These observations offer valuable insights into the current state of AI code generation and help developers identify the most suitable models for practical tasks. They also lay the groundwork for refining LLMs to generate secure and efficient code in real-world applications.
Related papers
- RealSec-bench: A Benchmark for Evaluating Secure Code Generation in Real-World Repositories [58.32028251925354]
Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, but their proficiency in producing secure code remains a critical, under-explored area.<n>We introduce RealSec-bench, a new benchmark for secure code generation meticulously constructed from real-world, high-risk Java repositories.
arXiv Detail & Related papers (2026-01-30T08:29:01Z) - MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks [56.34018316319873]
We propose MERA Code, a benchmark for evaluating code for the latest code generation LLMs in Russian.<n>This benchmark includes 11 evaluation tasks that span 8 programming languages.<n>We evaluate open LLMs and frontier API models, analyzing their limitations in terms of practical coding tasks in non-English languages.
arXiv Detail & Related papers (2025-07-16T14:31:33Z) - Are AI-Generated Fixes Secure? Analyzing LLM and Agent Patches on SWE-bench [9.229310642804036]
We present the first large-scale security analysis of LLM-generated patches using 20,000+ issues from the SWE-bench dataset.<n>We evaluate patches produced by a standalone LLM (Llama 3.3) and compare them to developer-written patches.<n>We also assess the security of patches generated by three top-performing agentic frameworks (OpenHands, AutoCodeRover, HoneyComb) on a subset of our data.
arXiv Detail & Related papers (2025-06-30T21:10:19Z) - SEC-bench: Automated Benchmarking of LLM Agents on Real-World Software Security Tasks [11.97472024483841]
SEC-bench is the first fully automated benchmarking framework for evaluating large language model (LLM) agents.<n>Our framework automatically creates high-quality software vulnerability datasets with reproducible artifacts at a cost of only $0.87 per instance.<n>A comprehensive evaluation of state-of-the-art LLM code agents reveals significant performance gaps.
arXiv Detail & Related papers (2025-06-13T13:54:30Z) - SafeGenBench: A Benchmark Framework for Security Vulnerability Detection in LLM-Generated Code [7.209766132478914]
We introduce SafeGenBench, a benchmark specifically designed to assess the security of LLM-generated code.<n>The dataset encompasses a wide range of common software development scenarios and vulnerability types.<n>Through the empirical evaluation of state-of-the-art LLMs on SafeGenBench, we reveal notable deficiencies in their ability to produce vulnerability-free code.
arXiv Detail & Related papers (2025-06-06T02:48:02Z) - Training Language Models to Generate Quality Code with Program Analysis Feedback [66.0854002147103]
Code generation with large language models (LLMs) is increasingly adopted in production but fails to ensure code quality.<n>We propose REAL, a reinforcement learning framework that incentivizes LLMs to generate production-quality code.
arXiv Detail & Related papers (2025-05-28T17:57:47Z) - Can You Really Trust Code Copilots? Evaluating Large Language Models from a Code Security Perspective [19.345433857645016]
CoV-Eval is a multi-task benchmark covering various tasks such as code completion, vulnerability repair, vulnerability detection and classification.<n> VC-Judge is an improved judgment model that aligns closely with human experts and can review LLM-generated programs for vulnerabilities.
arXiv Detail & Related papers (2025-05-15T16:53:41Z) - The Hidden Risks of LLM-Generated Web Application Code: A Security-Centric Evaluation of Code Generation Capabilities in Large Language Models [0.769672852567215]
This paper uses predefined security parameters to evaluate the security compliance of LLM-generated code across multiple models.<n>The analysis reveals critical vulnerabilities in authentication mechanisms, session management, input validation and HTTP security headers.<n>Our findings underscore that human expertise is crucial to ensure secure software deployment or review of LLM-generated code.
arXiv Detail & Related papers (2025-04-29T10:23:11Z) - CWEval: Outcome-driven Evaluation on Functionality and Security of LLM Code Generation [20.72188827088484]
Large Language Models (LLMs) have significantly aided developers by generating or assisting in code writing.<n> detecting vulnerabilities in functionally correct code is more challenging, especially for developers with limited security knowledge.<n>We introduce CWEval, a novel outcome-driven evaluation framework designed to enhance the evaluation of secure code generation by LLMs.
arXiv Detail & Related papers (2025-01-14T15:27:01Z) - SeCodePLT: A Unified Platform for Evaluating the Security of Code GenAI [58.29510889419971]
Existing benchmarks for evaluating the security risks and capabilities of code-generating large language models (LLMs) face several key limitations.<n>We introduce a general and scalable benchmark construction framework that begins with manually validated, high-quality seed examples and expands them via targeted mutations.<n>Applying this framework to Python, C/C++, and Java, we build SeCodePLT, a dataset of more than 5.9k samples spanning 44 CWE-based risk categories and three security capabilities.
arXiv Detail & Related papers (2024-10-14T21:17:22Z) - Codev-Bench: How Do LLMs Understand Developer-Centric Code Completion? [60.84912551069379]
We present the Code-Development Benchmark (Codev-Bench), a fine-grained, real-world, repository-level, and developer-centric evaluation framework.
Codev-Agent is an agent-based system that automates repository crawling, constructs execution environments, extracts dynamic calling chains from existing unit tests, and generates new test samples to avoid data leakage.
arXiv Detail & Related papers (2024-10-02T09:11:10Z) - CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion [117.178835165855]
This paper introduces CodeAttack, a framework that transforms natural language inputs into code inputs.
Our studies reveal a new and universal safety vulnerability of these models against code input.
We find that a larger distribution gap between CodeAttack and natural language leads to weaker safety generalization.
arXiv Detail & Related papers (2024-03-12T17:55:38Z) - SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models [107.82336341926134]
SALAD-Bench is a safety benchmark specifically designed for evaluating Large Language Models (LLMs)
It transcends conventional benchmarks through its large scale, rich diversity, intricate taxonomy spanning three levels, and versatile functionalities.
arXiv Detail & Related papers (2024-02-07T17:33:54Z) - SALLM: Security Assessment of Generated Code [0.5137309756089941]
This paper describes SALLM, a framework to benchmark Large Language Models' abilities to generate secure code systematically.
The framework has three major components: a novel dataset of security-centric Python prompts, assessment techniques to evaluate the generated code, and novel metrics to evaluate the models' performance from the perspective of secure code generation.
arXiv Detail & Related papers (2023-11-01T22:46:31Z) - Evaluating Model-free Reinforcement Learning toward Safety-critical
Tasks [70.76757529955577]
This paper revisits prior work in this scope from the perspective of state-wise safe RL.
We propose Unrolling Safety Layer (USL), a joint method that combines safety optimization and safety projection.
To facilitate further research in this area, we reproduce related algorithms in a unified pipeline and incorporate them into SafeRL-Kit.
arXiv Detail & Related papers (2022-12-12T06:30:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.