LLM4Vuln: A Unified Evaluation Framework for Decoupling and Enhancing
LLMs' Vulnerability Reasoning
- URL: http://arxiv.org/abs/2401.16185v1
- Date: Mon, 29 Jan 2024 14:32:27 GMT
- Title: LLM4Vuln: A Unified Evaluation Framework for Decoupling and Enhancing
LLMs' Vulnerability Reasoning
- Authors: Yuqiang Sun and Daoyuan Wu and Yue Xue and Han Liu and Wei Ma and
Lyuye Zhang and Miaolei Shi and Yang Liu
- Abstract summary: Large language models (LLMs) have demonstrated significant poten- tial for many downstream tasks, including vulnerability detection.
Recent attempts to use LLMs for vulnerability detection are prelim- inary, as they lack an in-depth understanding of a subject LLM's vulnerability reasoning capability.
We propose a unified evaluation framework named LLM4Vuln, which separates LLMs' vulnerability reasoning from their other capabilities.
- Score: 18.025174693883788
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have demonstrated significant poten- tial for
many downstream tasks, including those requiring human- level intelligence,
such as vulnerability detection. However, recent attempts to use LLMs for
vulnerability detection are still prelim- inary, as they lack an in-depth
understanding of a subject LLM's vulnerability reasoning capability - whether
it originates from the model itself or from external assistance, such as
invoking tool sup- port and retrieving vulnerability knowledge. In this paper,
we aim to decouple LLMs' vulnerability reason- ing capability from their other
capabilities, including the ability to actively seek additional information
(e.g., via function calling in SOTA models), adopt relevant vulnerability
knowledge (e.g., via vector-based matching and retrieval), and follow
instructions to out- put structured results. To this end, we propose a unified
evaluation framework named LLM4Vuln, which separates LLMs' vulnerability
reasoning from their other capabilities and evaluates how LLMs' vulnerability
reasoning could be enhanced when combined with the enhancement of other
capabilities. To demonstrate the effectiveness of LLM4Vuln, we have designed
controlled experiments using 75 ground-truth smart contract vulnerabilities,
which were extensively audited as high-risk on Code4rena from August to
November 2023, and tested them in 4,950 different scenarios across three
represen- tative LLMs (GPT-4, Mixtral, and Code Llama). Our results not only
reveal ten findings regarding the varying effects of knowledge en- hancement,
context supplementation, prompt schemes, and models but also enable us to
identify 9 zero-day vulnerabilities in two pilot bug bounty programs with over
1,000 USD being awarded.
Related papers
- Exploring Automatic Cryptographic API Misuse Detection in the Era of LLMs [60.32717556756674]
This paper introduces a systematic evaluation framework to assess Large Language Models in detecting cryptographic misuses.
Our in-depth analysis of 11,940 LLM-generated reports highlights that the inherent instabilities in LLMs can lead to over half of the reports being false positives.
The optimized approach achieves a remarkable detection rate of nearly 90%, surpassing traditional methods and uncovering previously unknown misuses in established benchmarks.
arXiv Detail & Related papers (2024-07-23T15:31:26Z) - AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language Models [95.09157454599605]
Large Language Models (LLMs) are becoming increasingly powerful, but they still exhibit significant but subtle weaknesses.
Traditional benchmarking approaches cannot thoroughly pinpoint specific model deficiencies.
We introduce a unified framework, AutoDetect, to automatically expose weaknesses in LLMs across various tasks.
arXiv Detail & Related papers (2024-06-24T15:16:45Z) - SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors [64.9938658716425]
Existing evaluations of large language models' (LLMs) ability to recognize and reject unsafe user requests face three limitations.
First, existing methods often use coarse-grained of unsafe topics, and are over-representing some fine-grained topics.
Second, linguistic characteristics and formatting of prompts are often overlooked, like different languages, dialects, and more -- which are only implicitly considered in many evaluations.
Third, existing evaluations rely on large LLMs for evaluation, which can be expensive.
arXiv Detail & Related papers (2024-06-20T17:56:07Z) - Towards Effectively Detecting and Explaining Vulnerabilities Using Large Language Models [17.96542494363619]
Large language models (LLMs) have shown a remarkable capability in the comprehension of complicated context and content generation.
We propose LLMVulExp, a framework that utilizes LLMs for vulnerability detection and explanation.
We find that LLMVulExp can effectively enable the LLMs to perform vulnerability detection (e.g., over 90% F1 score on SeVC dataset) and explanation.
arXiv Detail & Related papers (2024-06-14T04:01:25Z) - Investigating the prompt leakage effect and black-box defenses for multi-turn LLM interactions [125.21418304558948]
leakage in large language models (LLMs) poses a significant security and privacy threat.
leakage in multi-turn LLM interactions along with mitigation strategies has not been studied in a standardized manner.
This paper investigates LLM vulnerabilities against prompt leakage across 4 diverse domains and 10 closed- and open-source LLMs.
arXiv Detail & Related papers (2024-04-24T23:39:58Z) - Unveiling the Misuse Potential of Base Large Language Models via In-Context Learning [61.2224355547598]
Open-sourcing of large language models (LLMs) accelerates application development, innovation, and scientific progress.
Our investigation exposes a critical oversight in this belief.
By deploying carefully designed demonstrations, our research demonstrates that base LLMs could effectively interpret and execute malicious instructions.
arXiv Detail & Related papers (2024-04-16T13:22:54Z) - Multitask-based Evaluation of Open-Source LLM on Software Vulnerability [2.7692028382314815]
This paper proposes a pipeline for quantitatively evaluating interactive Large Language Models (LLMs) using publicly available datasets.
We carry out an extensive technical evaluation of LLMs using Big-Vul covering four different common software vulnerability tasks.
We find that the existing state-of-the-art approaches and pre-trained Language Models (LMs) are generally superior to LLMs in software vulnerability detection.
arXiv Detail & Related papers (2024-04-02T15:52:05Z) - An Empirical Study of Automated Vulnerability Localization with Large Language Models [21.84971967029474]
Large Language Models (LLMs) have shown potential in various domains, yet their effectiveness in vulnerability localization remains underexplored.
Our investigation encompasses 10+ leading LLMs suitable for code analysis, including ChatGPT and various open-source models.
We explore the efficacy of these LLMs using 4 distinct paradigms: zero-shot learning, one-shot learning, discriminative fine-tuning, and generative fine-tuning.
arXiv Detail & Related papers (2024-03-30T08:42:10Z) - How Far Have We Gone in Vulnerability Detection Using Large Language
Models [15.09461331135668]
We introduce a comprehensive vulnerability benchmark VulBench.
This benchmark aggregates high-quality data from a wide range of CTF challenges and real-world applications.
We find that several LLMs outperform traditional deep learning approaches in vulnerability detection.
arXiv Detail & Related papers (2023-11-21T08:20:39Z) - Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities [12.82645410161464]
Large Language Models (LLMs) have demonstrated remarkable performance on code-related tasks.
We evaluate whether pre-trained LLMs can detect security vulnerabilities and address the limitations of existing tools.
arXiv Detail & Related papers (2023-11-16T13:17:20Z) - Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs [59.596335292426105]
This paper collects the first open-source dataset to evaluate safeguards in large language models.
We train several BERT-like classifiers to achieve results comparable with GPT-4 on automatic safety evaluation.
arXiv Detail & Related papers (2023-08-25T14:02:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.