Related papers: LLM-Assisted Static Analysis for Detecting Security Vulnerabilities

LLM-Assisted Static Analysis for Detecting Security Vulnerabilities

URL: http://arxiv.org/abs/2405.17238v1
Date: Mon, 27 May 2024 14:53:35 GMT
Title: LLM-Assisted Static Analysis for Detecting Security Vulnerabilities
Authors: Ziyang Li, Saikat Dutta, Mayur Naik,
Abstract summary: We propose IRIS, which combines large language models with static analysis to perform whole-repository reasoning to detect vulnerabilities. IRIS detects 69 out of 120 vulnerabilities in CWE-Bench-Java using GPT-4, while the state-of-the-art static analysis tool only detects 27. IRIS also significantly reduces the number of false alarms (by more than 80% in the best case)
Score: 14.188864624736938
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Software is prone to security vulnerabilities. Program analysis tools to detect them have limited effectiveness in practice. While large language models (or LLMs) have shown impressive code generation capabilities, they cannot do complex reasoning over code to detect such vulnerabilities, especially because this task requires whole-repository analysis. In this work, we propose IRIS, the first approach that systematically combines LLMs with static analysis to perform whole-repository reasoning to detect security vulnerabilities. We curate a new dataset, CWE-Bench-Java, comprising 120 manually validated security vulnerabilities in real-world Java projects. These projects are complex, with an average of 300,000 lines of code and a maximum of up to 7 million. Out of 120 vulnerabilities in CWE-Bench-Java, IRIS detects 69 using GPT-4, while the state-of-the-art static analysis tool only detects 27. Further, IRIS also significantly reduces the number of false alarms (by more than 80% in the best case).

Related papers

CASTLE: Benchmarking Dataset for Static Code Analyzers and LLMs towards CWE Detection [2.5228276786940182]
This paper introduces CASTLE, a benchmarking framework for evaluating the vulnerability detection capabilities of different methods. We assess 13 static analysis tools, 10 LLMs, and 2 formal verification tools using a hand-crafted dataset of 250 micro-benchmark programs covering 25 common CWEs.
arXiv Detail & Related papers (2025-03-12T14:30:05Z)
Making Them a Malicious Database: Exploiting Query Code to Jailbreak Aligned Large Language Models [44.27350994698781]
We propose a novel framework to examine the generalizability of safety alignment. By treating LLMs as knowledge databases, we translate malicious queries in natural language into structured non-natural query language. We conduct extensive experiments on mainstream LLMs, and the results show that QueryAttack can achieve high attack success rates.
arXiv Detail & Related papers (2025-02-13T19:13:03Z)
Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities [63.603861880022954]
We introduce ADV-LLM, an iterative self-tuning process that crafts adversarial LLMs with enhanced jailbreak ability. Our framework significantly reduces the computational cost of generating adversarial suffixes while achieving nearly 100% ASR on various open-source LLMs. It exhibits strong attack transferability to closed-source models, achieving 99% ASR on GPT-3.5 and 49% ASR on GPT-4, despite being optimized solely on Llama3.
arXiv Detail & Related papers (2024-10-24T06:36:12Z)
ProveRAG: Provenance-Driven Vulnerability Analysis with Automated Retrieval-Augmented LLMs [1.7191671053507043]
Security analysts face the challenge of mitigating newly discovered vulnerabilities in real-time. Over 300,000 Common Vulnerabilities and Exposures have been identified since 1999. Over 25,000 vulnerabilities have been identified so far in 2024.
arXiv Detail & Related papers (2024-10-22T20:28:57Z)
Exploring Automatic Cryptographic API Misuse Detection in the Era of LLMs [60.32717556756674]
This paper introduces a systematic evaluation framework to assess Large Language Models in detecting cryptographic misuses. Our in-depth analysis of 11,940 LLM-generated reports highlights that the inherent instabilities in LLMs can lead to over half of the reports being false positives. The optimized approach achieves a remarkable detection rate of nearly 90%, surpassing traditional methods and uncovering previously unknown misuses in established benchmarks.
arXiv Detail & Related papers (2024-07-23T15:31:26Z)
SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors [64.9938658716425]
Existing evaluations of large language models' (LLMs) ability to recognize and reject unsafe user requests face three limitations. First, existing methods often use coarse-grained of unsafe topics, and are over-representing some fine-grained topics. Second, linguistic characteristics and formatting of prompts are often overlooked, like different languages, dialects, and more -- which are only implicitly considered in many evaluations. Third, existing evaluations rely on large LLMs for evaluation, which can be expensive.
arXiv Detail & Related papers (2024-06-20T17:56:07Z)
VulDetectBench: Evaluating the Deep Capability of Vulnerability Detection with Large Language Models [12.465060623389151]
This study introduces a new benchmark, VulDetectBench, to assess the vulnerability detection capabilities of Large Language Models (LLMs) The benchmark comprehensively evaluates LLM's ability to identify, classify, and locate vulnerabilities through five tasks of increasing difficulty. Our benchmark effectively evaluates the capabilities of various LLMs at different levels in the specific task of vulnerability detection, providing a foundation for future research and improvements in this critical area of code security.
arXiv Detail & Related papers (2024-06-11T13:42:57Z)
Unveiling the Misuse Potential of Base Large Language Models via In-Context Learning [61.2224355547598]
Open-sourcing of large language models (LLMs) accelerates application development, innovation, and scientific progress. Our investigation exposes a critical oversight in this belief. By deploying carefully designed demonstrations, our research demonstrates that base LLMs could effectively interpret and execute malicious instructions.
arXiv Detail & Related papers (2024-04-16T13:22:54Z)
An Insight into Security Code Review with LLMs: Capabilities, Obstacles and Influential Factors [9.309745288471374]
Security code review is a time-consuming and labor-intensive process. Existing security analysis tools struggle with poor generalization, high false positive rates, and coarse detection granularity. Large Language Models (LLMs) have been considered promising candidates for addressing those challenges.
arXiv Detail & Related papers (2024-01-29T17:13:44Z)
Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities [12.82645410161464]
We evaluate the effectiveness of 16 pre-trained Large Language Models on 5,000 code samples from five diverse security datasets. Overall, LLMs show modest effectiveness in detecting vulnerabilities, obtaining an average accuracy of 62.8% and F1 score of 0.71 across datasets. We find that advanced prompting strategies that involve step-by-step analysis significantly improve performance of LLMs on real-world datasets in terms of F1 score (by upto 0.18 on average)
arXiv Detail & Related papers (2023-11-16T13:17:20Z)
Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs [59.596335292426105]
This paper collects the first open-source dataset to evaluate safeguards in large language models. We train several BERT-like classifiers to achieve results comparable with GPT-4 on automatic safety evaluation.
arXiv Detail & Related papers (2023-08-25T14:02:12Z)
Can Large Language Models Find And Fix Vulnerable Software? [0.0]
GPT-4 identified approximately four times the vulnerabilities than its counterparts. It provided viable fixes for each vulnerability, demonstrating a low rate of false positives. GPT-4's code corrections led to a 90% reduction in vulnerabilities, requiring only an 11% increase in code lines.
arXiv Detail & Related papers (2023-08-20T19:33:12Z)
VELVET: a noVel Ensemble Learning approach to automatically locate VulnErable sTatements [62.93814803258067]
This paper presents VELVET, a novel ensemble learning approach to locate vulnerable statements in source code. Our model combines graph-based and sequence-based neural networks to successfully capture the local and global context of a program graph. VELVET achieves 99.6% and 43.6% top-1 accuracy over synthetic data and real-world data, respectively.
arXiv Detail & Related papers (2021-12-20T22:45:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.