Related papers: LLMSecEval: A Dataset of Natural Language Prompts for Security Evaluations

LLMSecEval: A Dataset of Natural Language Prompts for Security Evaluations

URL: http://arxiv.org/abs/2303.09384v1
Date: Thu, 16 Mar 2023 15:13:58 GMT
Title: LLMSecEval: A Dataset of Natural Language Prompts for Security Evaluations
Authors: Catherine Tony, Markus Mutas, Nicol\'as E. D\'iaz Ferreyra and Riccardo Scandariato
Abstract summary: Large Language Models (LLMs) like Codex are powerful tools for performing code completion and code generation tasks. These models are capable of generating code snippets from Natural Language (NL) descriptions by learning languages and programming practices from public GitHub repositories. Although LLMs promise an effortless NL-driven deployment of software applications, the security of the code they generate has not been extensively investigated nor documented.
Score: 4.276841620787673
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) like Codex are powerful tools for performing code completion and code generation tasks as they are trained on billions of lines of code from publicly available sources. Moreover, these models are capable of generating code snippets from Natural Language (NL) descriptions by learning languages and programming practices from public GitHub repositories. Although LLMs promise an effortless NL-driven deployment of software applications, the security of the code they generate has not been extensively investigated nor documented. In this work, we present LLMSecEval, a dataset containing 150 NL prompts that can be leveraged for assessing the security performance of such models. Such prompts are NL descriptions of code snippets prone to various security vulnerabilities listed in MITRE's Top 25 Common Weakness Enumeration (CWE) ranking. Each prompt in our dataset comes with a secure implementation example to facilitate comparative evaluations against code produced by LLMs. As a practical application, we show how LLMSecEval can be used for evaluating the security of snippets automatically generated from NL descriptions.

Related papers

TypePilot: Leveraging the Scala Type System for Secure LLM-generated Code [46.747768845221735]
Large language Models (LLMs) have shown remarkable proficiency in code generation tasks across various programming languages.<n>Their outputs often contain subtle but critical vulnerabilities, posing significant risks when deployed in security-sensitive or mission-critical systems.<n>This paper introduces TypePilot, an agentic AI framework designed to enhance the security and robustness of LLM-generated code.
arXiv Detail & Related papers (2025-10-13T08:44:01Z)
The Fools are Certain; the Wise are Doubtful: Exploring LLM Confidence in Code Completion [4.215010577170175]
We evaluate the confidence of Large Language Models (LLMs) when generating code by measuring code perplexity.<n>We find that strongly-typed languages exhibit lower perplexity than dynamically typed languages.<n> Perl appears universally high in perplexity, whereas Java appears low.
arXiv Detail & Related papers (2025-08-22T06:51:13Z)
IFEvalCode: Controlled Code Generation [69.28317223249358]
The paper introduces forward and backward constraints generation to improve the instruction-following capabilities of Code LLMs.<n>The authors present IFEvalCode, a multilingual benchmark comprising 1.6K test samples across seven programming languages.
arXiv Detail & Related papers (2025-07-30T08:08:48Z)
NoCode-bench: A Benchmark for Evaluating Natural Language-Driven Feature Addition [16.134058143793304]
This work introduces NoCode-bench, a benchmark designed to evaluate large language models (LLMs) on real-world NL-driven feature addition tasks.<n>A subset of 114 high-quality, human-verified instances, NoCode-bench Verified, ensures reliable evaluation.<n>Our experiments reveal that, despite high token usage, the best LLMs achieve a task success rate of only 15.79%, highlighting challenges in cross-file editing, understanding, and tool calling.
arXiv Detail & Related papers (2025-07-24T06:38:19Z)
Security and Quality in LLM-Generated Code: A Multi-Language, Multi-Model Analysis [10.268191178804168]
This paper analyzes the security of code generated by Large Language Models (LLMs) across different programming languages. Our research shows that while LLMs can automate code creation, their security effectiveness varies by language.
arXiv Detail & Related papers (2025-02-03T22:03:13Z)
Leveraging LLMs for Legacy Code Modernization: Challenges and Opportunities for LLM-Generated Documentation [2.249533649156367]
Legacy software systems, written in outdated languages like MUMPS and mainframe assembly, pose challenges in efficiency, maintenance, staffing, and security. This paper investigates the utilization of LLMs to generate documentation for legacy code using two datasets. We propose a prompting strategy for generating line-wise code comments and a rubric to evaluate their completeness, readability, usefulness, and hallucination.
arXiv Detail & Related papers (2024-11-22T14:27:27Z)
VulnLLMEval: A Framework for Evaluating Large Language Models in Software Vulnerability Detection and Patching [0.9208007322096533]
Large Language Models (LLMs) have shown promise in tasks like code translation. This paper introduces VulnLLMEval, a framework designed to assess the performance of LLMs in identifying and patching vulnerabilities in C code. Our study includes 307 real-world vulnerabilities extracted from the Linux kernel.
arXiv Detail & Related papers (2024-09-16T22:00:20Z)
HexaCoder: Secure Code Generation via Oracle-Guided Synthetic Training Data [60.75578581719921]
Large language models (LLMs) have shown great potential for automatic code generation. Recent studies highlight that many LLM-generated code contains serious security vulnerabilities. We introduce HexaCoder, a novel approach to enhance the ability of LLMs to generate secure codes.
arXiv Detail & Related papers (2024-09-10T12:01:43Z)
An Exploratory Study on Fine-Tuning Large Language Models for Secure Code Generation [17.69409515806874]
We present an exploratory study on whether fine-tuning pre-trained LLMs on datasets of vulnerability-fixing commits can promote secure code generation. We crawled a fine-tuning dataset for secure code generation by collecting code fixes of confirmed vulnerabilities from open-source repositories. Our exploration reveals that fine-tuning LLMs can improve secure code generation by 6.4% in C language and 5.4% in C++ language.
arXiv Detail & Related papers (2024-08-17T02:51:27Z)
Can We Trust Large Language Models Generated Code? A Framework for In-Context Learning, Security Patterns, and Code Evaluations Across Diverse LLMs [2.7138982369416866]
Large Language Models (LLMs) have revolutionized automated code generation in software engineering. However, concerns have arisen regarding the security and quality of the generated code. Our research aims to tackle these issues by introducing a framework for secure behavioral learning of LLMs.
arXiv Detail & Related papers (2024-06-18T11:29:34Z)
CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion [117.178835165855]
This paper introduces CodeAttack, a framework that transforms natural language inputs into code inputs. Our studies reveal a new and universal safety vulnerability of these models against code input. We find that a larger distribution gap between CodeAttack and natural language leads to weaker safety generalization.
arXiv Detail & Related papers (2024-03-12T17:55:38Z)
If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code Empowers Large Language Models to Serve as Intelligent Agents [81.60906807941188]
Large language models (LLMs) are trained on a combination of natural language and formal language (code) Code translates high-level goals into executable steps, featuring standard syntax, logical consistency, abstraction, and modularity.
arXiv Detail & Related papers (2024-01-01T16:51:20Z)
SALLM: Security Assessment of Generated Code [0.5137309756089941]
This paper describes SALLM, a framework to benchmark Large Language Models' abilities to generate secure code systematically. The framework has three major components: a novel dataset of security-centric Python prompts, assessment techniques to evaluate the generated code, and novel metrics to evaluate the models' performance from the perspective of secure code generation.
arXiv Detail & Related papers (2023-11-01T22:46:31Z)
CodeFuse-13B: A Pretrained Multi-lingual Code Large Language Model [58.127534002232096]
This paper introduces CodeFuse-13B, an open-sourced pre-trained code LLM. It is specifically designed for code-related tasks with both English and Chinese prompts. CodeFuse achieves its effectiveness by utilizing a high quality pre-training dataset.
arXiv Detail & Related papers (2023-10-10T02:38:44Z)
Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs [59.596335292426105]
This paper collects the first open-source dataset to evaluate safeguards in large language models. We train several BERT-like classifiers to achieve results comparable with GPT-4 on automatic safety evaluation.
arXiv Detail & Related papers (2023-08-25T14:02:12Z)
CodeLMSec Benchmark: Systematically Evaluating and Finding Security Vulnerabilities in Black-Box Code Language Models [58.27254444280376]
Large language models (LLMs) for automatic code generation have achieved breakthroughs in several programming tasks. Training data for these models is usually collected from the Internet (e.g., from open-source repositories) and is likely to contain faults and security vulnerabilities. This unsanitized training data can cause the language models to learn these vulnerabilities and propagate them during the code generation procedure.
arXiv Detail & Related papers (2023-02-08T11:54:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.