Related papers: Can You Really Trust Code Copilots? Evaluating Large Language Models from a Code Security Perspective

Can You Really Trust Code Copilots? Evaluating Large Language Models from a Code Security Perspective

URL: http://arxiv.org/abs/2505.10494v1
Date: Thu, 15 May 2025 16:53:41 GMT
Title: Can You Really Trust Code Copilots? Evaluating Large Language Models from a Code Security Perspective
Authors: Yutao Mou, Xiao Deng, Yuxiao Luo, Shikun Zhang, Wei Ye,
Abstract summary: CoV-Eval is a multi-task benchmark covering various tasks such as code completion, vulnerability repair, vulnerability detection and classification.<n> VC-Judge is an improved judgment model that aligns closely with human experts and can review LLM-generated programs for vulnerabilities.
Score: 19.345433857645016
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Code security and usability are both essential for various coding assistant applications driven by large language models (LLMs). Current code security benchmarks focus solely on single evaluation task and paradigm, such as code completion and generation, lacking comprehensive assessment across dimensions like secure code generation, vulnerability repair and discrimination. In this paper, we first propose CoV-Eval, a multi-task benchmark covering various tasks such as code completion, vulnerability repair, vulnerability detection and classification, for comprehensive evaluation of LLM code security. Besides, we developed VC-Judge, an improved judgment model that aligns closely with human experts and can review LLM-generated programs for vulnerabilities in a more efficient and reliable way. We conduct a comprehensive evaluation of 20 proprietary and open-source LLMs. Overall, while most LLMs identify vulnerable codes well, they still tend to generate insecure codes and struggle with recognizing specific vulnerability types and performing repairs. Extensive experiments and qualitative analyses reveal key challenges and optimization directions, offering insights for future research in LLM code security.

Related papers

Guiding AI to Fix Its Own Flaws: An Empirical Study on LLM-Driven Secure Code Generation [16.29310628754089]
Large Language Models (LLMs) have become powerful tools for automated code generation.<n>LLMs often overlook critical security practices, which can result in the generation of insecure code.<n>This paper examines their inherent tendencies to produce insecure code, their capability to generate secure code when guided by self-generated vulnerability hints, and their effectiveness in repairing vulnerabilities when provided with different levels of feedback.
arXiv Detail & Related papers (2025-06-28T23:24:33Z)
SafeGenBench: A Benchmark Framework for Security Vulnerability Detection in LLM-Generated Code [7.209766132478914]
We introduce SafeGenBench, a benchmark specifically designed to assess the security of LLM-generated code.<n>The dataset encompasses a wide range of common software development scenarios and vulnerability types.<n>Through the empirical evaluation of state-of-the-art LLMs on SafeGenBench, we reveal notable deficiencies in their ability to produce vulnerability-free code.
arXiv Detail & Related papers (2025-06-06T02:48:02Z)
Training Language Models to Generate Quality Code with Program Analysis Feedback [66.0854002147103]
Code generation with large language models (LLMs) is increasingly adopted in production but fails to ensure code quality.<n>We propose REAL, a reinforcement learning framework that incentivizes LLMs to generate production-quality code.
arXiv Detail & Related papers (2025-05-28T17:57:47Z)
The Hidden Risks of LLM-Generated Web Application Code: A Security-Centric Evaluation of Code Generation Capabilities in Large Language Models [0.769672852567215]
This paper uses predefined security parameters to evaluate the security compliance of LLM-generated code across multiple models.<n>The analysis reveals critical vulnerabilities in authentication mechanisms, session management, input validation and HTTP security headers.<n>Our findings underscore that human expertise is crucial to ensure secure software deployment or review of LLM-generated code.
arXiv Detail & Related papers (2025-04-29T10:23:11Z)
CWEval: Outcome-driven Evaluation on Functionality and Security of LLM Code Generation [20.72188827088484]
Large Language Models (LLMs) have significantly aided developers by generating or assisting in code writing.<n> detecting vulnerabilities in functionally correct code is more challenging, especially for developers with limited security knowledge.<n>We introduce CWEval, a novel outcome-driven evaluation framework designed to enhance the evaluation of secure code generation by LLMs.
arXiv Detail & Related papers (2025-01-14T15:27:01Z)
SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models [75.67623347512368]
We propose toolns, a comprehensive framework designed for conducting safety evaluations of MLLMs. Our framework consists of a comprehensive harmful query dataset and an automated evaluation protocol. Based on our framework, we conducted large-scale experiments on 15 widely-used open-source MLLMs and 6 commercial MLLMs.
arXiv Detail & Related papers (2024-10-24T17:14:40Z)
HexaCoder: Secure Code Generation via Oracle-Guided Synthetic Training Data [60.75578581719921]
Large language models (LLMs) have shown great potential for automatic code generation. Recent studies highlight that many LLM-generated code contains serious security vulnerabilities. We introduce HexaCoder, a novel approach to enhance the ability of LLMs to generate secure codes.
arXiv Detail & Related papers (2024-09-10T12:01:43Z)
Is Your AI-Generated Code Really Safe? Evaluating Large Language Models on Secure Code Generation with CodeSecEval [20.959848710829878]
Large language models (LLMs) have brought significant advancements to code generation and code repair. However, their training using unsanitized data from open-source repositories, like GitHub, raises the risk of inadvertently propagating security vulnerabilities. We aim to present a comprehensive study aimed at precisely evaluating and enhancing the security aspects of code LLMs.
arXiv Detail & Related papers (2024-07-02T16:13:21Z)
INDICT: Code Generation with Internal Dialogues of Critiques for Both Security and Helpfulness [110.6921470281479]
We introduce INDICT: a new framework that empowers large language models with Internal Dialogues of Critiques for both safety and helpfulness guidance. The internal dialogue is a dual cooperative system between a safety-driven critic and a helpfulness-driven critic. We observed that our approach can provide an advanced level of critiques of both safety and helpfulness analysis, significantly improving the quality of output codes.
arXiv Detail & Related papers (2024-06-23T15:55:07Z)
CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion [117.178835165855]
This paper introduces CodeAttack, a framework that transforms natural language inputs into code inputs. Our studies reveal a new and universal safety vulnerability of these models against code input. We find that a larger distribution gap between CodeAttack and natural language leads to weaker safety generalization.
arXiv Detail & Related papers (2024-03-12T17:55:38Z)
InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models [56.723509505549536]
InfiBench is the first large-scale freeform question-answering (QA) benchmark for code to our knowledge. It comprises 234 carefully selected high-quality Stack Overflow questions that span across 15 programming languages. We conduct a systematic evaluation for over 100 latest code LLMs on InfiBench, leading to a series of novel and insightful findings.
arXiv Detail & Related papers (2024-03-11T02:06:30Z)
SALLM: Security Assessment of Generated Code [0.5137309756089941]
This paper describes SALLM, a framework to benchmark Large Language Models' abilities to generate secure code systematically. The framework has three major components: a novel dataset of security-centric Python prompts, assessment techniques to evaluate the generated code, and novel metrics to evaluate the models' performance from the perspective of secure code generation.
arXiv Detail & Related papers (2023-11-01T22:46:31Z)
CodeLMSec Benchmark: Systematically Evaluating and Finding Security Vulnerabilities in Black-Box Code Language Models [58.27254444280376]
Large language models (LLMs) for automatic code generation have achieved breakthroughs in several programming tasks. Training data for these models is usually collected from the Internet (e.g., from open-source repositories) and is likely to contain faults and security vulnerabilities. This unsanitized training data can cause the language models to learn these vulnerabilities and propagate them during the code generation procedure.
arXiv Detail & Related papers (2023-02-08T11:54:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.