CodeLMSec Benchmark: Systematically Evaluating and Finding Security
Vulnerabilities in Black-Box Code Language Models
- URL: http://arxiv.org/abs/2302.04012v2
- Date: Mon, 23 Oct 2023 14:08:09 GMT
- Title: CodeLMSec Benchmark: Systematically Evaluating and Finding Security
Vulnerabilities in Black-Box Code Language Models
- Authors: Hossein Hajipour, Keno Hassler, Thorsten Holz, Lea Sch\"onherr, Mario
Fritz
- Abstract summary: Large language models (LLMs) for automatic code generation have achieved breakthroughs in several programming tasks.
Training data for these models is usually collected from the Internet (e.g., from open-source repositories) and is likely to contain faults and security vulnerabilities.
This unsanitized training data can cause the language models to learn these vulnerabilities and propagate them during the code generation procedure.
- Score: 58.27254444280376
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Large language models (LLMs) for automatic code generation have achieved
breakthroughs in several programming tasks. Their advances in competition-level
programming problems have made them an essential pillar of AI-assisted pair
programming, and tools such as GitHub Copilot have emerged as part of the daily
programming workflow used by millions of developers. The training data for
these models is usually collected from the Internet (e.g., from open-source
repositories) and is likely to contain faults and security vulnerabilities.
This unsanitized training data can cause the language models to learn these
vulnerabilities and propagate them during the code generation procedure. While
these models have been extensively assessed for their ability to produce
functionally correct programs, there remains a lack of comprehensive
investigations and benchmarks addressing the security aspects of these models.
In this work, we propose a method to systematically study the security issues
of code language models to assess their susceptibility to generating vulnerable
code. To this end, we introduce the first approach to automatically find
generated code that contains vulnerabilities in black-box code generation
models. To achieve this, we present an approach to approximate inversion of the
black-box code generation models based on few-shot prompting. We evaluate the
effectiveness of our approach by examining code language models in generating
high-risk security weaknesses. Furthermore, we establish a collection of
diverse non-secure prompts for various vulnerability scenarios using our
method. This dataset forms a benchmark for evaluating and comparing the
security weaknesses in code language models.
Related papers
- Security of Language Models for Code: A Systematic Literature Review [22.046891149121812]
Language models for code (CodeLMs) have emerged as powerful tools for code-related tasks.
CodeLMs are susceptible to security vulnerabilities, drawing increasing research attention from domains such as software engineering, artificial intelligence, and cybersecurity.
arXiv Detail & Related papers (2024-10-21T04:27:41Z) - HexaCoder: Secure Code Generation via Oracle-Guided Synthetic Training Data [60.75578581719921]
Large language models (LLMs) have shown great potential for automatic code generation.
Recent studies highlight that many LLM-generated code contains serious security vulnerabilities.
We introduce HexaCoder, a novel approach to enhance the ability of LLMs to generate secure codes.
arXiv Detail & Related papers (2024-09-10T12:01:43Z) - SIaM: Self-Improving Code-Assisted Mathematical Reasoning of Large Language Models [54.78329741186446]
We propose a novel paradigm that uses a code-based critic model to guide steps including question-code data construction, quality control, and complementary evaluation.
Experiments across both in-domain and out-of-domain benchmarks in English and Chinese demonstrate the effectiveness of the proposed paradigm.
arXiv Detail & Related papers (2024-08-28T06:33:03Z) - An Exploratory Study on Fine-Tuning Large Language Models for Secure Code Generation [17.69409515806874]
We present an exploratory study on whether fine-tuning pre-trained LLMs on datasets of vulnerability-fixing commits can promote secure code generation.
We crawled a fine-tuning dataset for secure code generation by collecting code fixes of confirmed vulnerabilities from open-source repositories.
Our exploration reveals that fine-tuning LLMs can improve secure code generation by 6.4% in C language and 5.4% in C++ language.
arXiv Detail & Related papers (2024-08-17T02:51:27Z) - Is Your AI-Generated Code Really Safe? Evaluating Large Language Models on Secure Code Generation with CodeSecEval [20.959848710829878]
Large language models (LLMs) have brought significant advancements to code generation and code repair.
However, their training using unsanitized data from open-source repositories, like GitHub, raises the risk of inadvertently propagating security vulnerabilities.
We aim to present a comprehensive study aimed at precisely evaluating and enhancing the security aspects of code LLMs.
arXiv Detail & Related papers (2024-07-02T16:13:21Z) - M2CVD: Enhancing Vulnerability Semantic through Multi-Model Collaboration for Code Vulnerability Detection [52.4455893010468]
Large Language Models (LLMs) have strong capabilities in code comprehension, but fine-tuning costs and semantic alignment issues limit their project-specific optimization.
Code models such CodeBERT are easy to fine-tune, but it is often difficult to learn vulnerability semantics from complex code languages.
This paper introduces the Multi-Model Collaborative Vulnerability Detection approach (M2CVD) to improve the detection accuracy of code models.
arXiv Detail & Related papers (2024-06-10T00:05:49Z) - CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion [117.178835165855]
This paper introduces CodeAttack, a framework that transforms natural language inputs into code inputs.
Our studies reveal a new and universal safety vulnerability of these models against code input.
We find that a larger distribution gap between CodeAttack and natural language leads to weaker safety generalization.
arXiv Detail & Related papers (2024-03-12T17:55:38Z) - SALLM: Security Assessment of Generated Code [0.5137309756089941]
This paper describes SALLM, a framework to benchmark Large Language Models' abilities to generate secure code systematically.
The framework has three major components: a novel dataset of security-centric Python prompts, assessment techniques to evaluate the generated code, and novel metrics to evaluate the models' performance from the perspective of secure code generation.
arXiv Detail & Related papers (2023-11-01T22:46:31Z) - Enhancing Large Language Models for Secure Code Generation: A
Dataset-driven Study on Vulnerability Mitigation [24.668682498171776]
Large language models (LLMs) have brought significant advancements to code generation, benefiting both novice and experienced developers.
However, their training using unsanitized data from open-source repositories, like GitHub, introduces the risk of inadvertently propagating security vulnerabilities.
This paper presents a comprehensive study focused on evaluating and enhancing code LLMs from a software security perspective.
arXiv Detail & Related papers (2023-10-25T00:32:56Z) - L2CEval: Evaluating Language-to-Code Generation Capabilities of Large
Language Models [102.00201523306986]
We present L2CEval, a systematic evaluation of the language-to-code generation capabilities of large language models (LLMs)
We analyze the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods.
In addition to assessing model performance, we measure confidence calibration for the models and conduct human evaluations of the output programs.
arXiv Detail & Related papers (2023-09-29T17:57:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.