How well does LLM generate security tests?
- URL: http://arxiv.org/abs/2310.00710v2
- Date: Tue, 3 Oct 2023 03:29:12 GMT
- Title: How well does LLM generate security tests?
- Authors: Ying Zhang, Wenjia Song, Zhengjie Ji, Danfeng (Daphne) Yao, Na Meng
- Abstract summary: Developers often build software on top of third-party libraries (Libs) to improve productivity and software quality.
People refer to such attacks as supply chain attacks, the documented number of which has increased 742% in 2022.
We used ChatGPT-4.0 to generate security tests, and to demonstrate how vulnerable library dependencies facilitate the supply chain attacks to given Apps.
- Score: 8.454827764115631
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Developers often build software on top of third-party libraries (Libs) to
improve programmer productivity and software quality. The libraries may contain
vulnerabilities exploitable by hackers to attack the applications (Apps) built
on top of them. People refer to such attacks as supply chain attacks, the
documented number of which has increased 742% in 2022. People created tools to
mitigate such attacks, by scanning the library dependencies of Apps,
identifying the usage of vulnerable library versions, and suggesting secure
alternatives to vulnerable dependencies. However, recent studies show that many
developers do not trust the reports by these tools; they ask for code or
evidence to demonstrate how library vulnerabilities lead to security exploits,
in order to assess vulnerability severity and modification necessity.
Unfortunately, manually crafting demos of application-specific attacks is
challenging and time-consuming, and there is insufficient tool support to
automate that procedure.
In this study, we used ChatGPT-4.0 to generate security tests, and to
demonstrate how vulnerable library dependencies facilitate the supply chain
attacks to given Apps. We explored various prompt styles/templates, and found
that ChatGPT-4.0 generated tests for all 55 Apps, demonstrating 24 attacks
successfully. It outperformed two state-of-the-art security test generators --
TRANSFER and SIEGE -- by generating a lot more tests and achieving more
exploits. ChatGPT-4.0 worked better when prompts described more on the
vulnerabilities, possible exploits, and code context. Our research will shed
light on new research in security test generation. The generated tests will
help developers create secure by design and secure by default software.
Related papers
- HexaCoder: Secure Code Generation via Oracle-Guided Synthetic Training Data [60.75578581719921]
Large language models (LLMs) have shown great potential for automatic code generation.
Recent studies highlight that many LLM-generated code contains serious security vulnerabilities.
We introduce HexaCoder, a novel approach to enhance the ability of LLMs to generate secure codes.
arXiv Detail & Related papers (2024-09-10T12:01:43Z) - WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models [66.34505141027624]
We introduce WildTeaming, an automatic LLM safety red-teaming framework that mines in-the-wild user-chatbot interactions to discover 5.7K unique clusters of novel jailbreak tactics.
WildTeaming reveals previously unidentified vulnerabilities of frontier LLMs, resulting in up to 4.6x more diverse and successful adversarial attacks.
arXiv Detail & Related papers (2024-06-26T17:31:22Z) - Static Application Security Testing (SAST) Tools for Smart Contracts: How Far Are We? [14.974832502863526]
In recent years, the importance of smart contract security has been heightened by the increasing number of attacks against them.
To address this issue, a multitude of static application security testing (SAST) tools have been proposed for detecting vulnerabilities in smart contracts.
In this paper, we propose an up-to-date and fine-grained taxonomy that includes 45 unique vulnerability types for smart contracts.
arXiv Detail & Related papers (2024-04-28T13:40:18Z) - LLMs in Web Development: Evaluating LLM-Generated PHP Code Unveiling Vulnerabilities and Limitations [0.0]
This study evaluates the security of web application code generated by Large Language Models, analyzing 2,500 GPT-4 generated PHP websites.
Our investigation focuses on identifying Insecure File Upload,sql Injection, Stored XSS, and Reflected XSS in GPT-4 generated PHP code.
According to Burp's Scan, 11.56% of the sites can be straight out compromised. Adding static scan results, 26% had at least one vulnerability that can be exploited through web interaction.
arXiv Detail & Related papers (2024-04-21T20:56:02Z) - An Investigation into Misuse of Java Security APIs by Large Language Models [9.453671056356837]
This paper systematically assesses ChatGPT's trustworthiness in code generation for security API use cases in Java.
Around 70% of the code instances across 30 attempts per task contain security API misuse, with 20 distinct misuse types identified.
For roughly half of the tasks, this rate reaches 100%, indicating that there is a long way to go before developers can rely on ChatGPT to securely implement security API code.
arXiv Detail & Related papers (2024-04-04T22:52:41Z) - A StrongREJECT for Empty Jailbreaks [72.8807309802266]
StrongREJECT is a high-quality benchmark for evaluating jailbreak performance.
It scores the harmfulness of a victim model's responses to forbidden prompts.
It achieves state-of-the-art agreement with human judgments of jailbreak effectiveness.
arXiv Detail & Related papers (2024-02-15T18:58:09Z) - Exploiting Library Vulnerability via Migration Based Automating Test
Generation [16.39796265296833]
In software development, developers extensively utilize third-party libraries to avoid implementing existing functionalities.
Vulnerability exploits, as code snippets provided for reproducing vulnerabilities after disclosure, contain a wealth of vulnerability-related information.
This study proposes a new method based on vulnerability exploits, called VESTA, which provides vulnerability exploit tests as the basis for developers to decide whether to update dependencies.
arXiv Detail & Related papers (2023-12-15T06:46:45Z) - Identifying the Risks of LM Agents with an LM-Emulated Sandbox [68.26587052548287]
Language Model (LM) agents and tools enable a rich set of capabilities but also amplify potential risks.
High cost of testing these agents will make it increasingly difficult to find high-stakes, long-tailed risks.
We introduce ToolEmu: a framework that uses an LM to emulate tool execution and enables the testing of LM agents against a diverse range of tools and scenarios.
arXiv Detail & Related papers (2023-09-25T17:08:02Z) - Demystifying RCE Vulnerabilities in LLM-Integrated Apps [20.01949990700702]
Large Language Models (LLMs) have demonstrated remarkable potential across various downstream tasks.
Some frameworks suffer from Remote Code Execution (RCE) vulnerabilities, allowing attackers to execute arbitrary code on apps' servers remotely via prompt injections.
We present two novel strategies, including 1) a static analysis-based tool called LLMSmith to scan the source code of the framework to detect potential RCE vulnerabilities and 2) a prompt-based automated testing approach to verify the vulnerability in LLM-integrated web apps.
arXiv Detail & Related papers (2023-09-06T11:39:37Z) - A LLM Assisted Exploitation of AI-Guardian [57.572998144258705]
We evaluate the robustness of AI-Guardian, a recent defense to adversarial examples published at IEEE S&P 2023.
We write none of the code to attack this model, and instead prompt GPT-4 to implement all attack algorithms following our instructions and guidance.
This process was surprisingly effective and efficient, with the language model at times producing code from ambiguous instructions faster than the author of this paper could have done.
arXiv Detail & Related papers (2023-07-20T17:33:25Z) - CodeLMSec Benchmark: Systematically Evaluating and Finding Security
Vulnerabilities in Black-Box Code Language Models [58.27254444280376]
Large language models (LLMs) for automatic code generation have achieved breakthroughs in several programming tasks.
Training data for these models is usually collected from the Internet (e.g., from open-source repositories) and is likely to contain faults and security vulnerabilities.
This unsanitized training data can cause the language models to learn these vulnerabilities and propagate them during the code generation procedure.
arXiv Detail & Related papers (2023-02-08T11:54:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.