LLM vs. SAST: A Technical Analysis on Detecting Coding Bugs of GPT4-Advanced Data Analysis
- URL: http://arxiv.org/abs/2506.15212v1
- Date: Wed, 18 Jun 2025 07:47:12 GMT
- Title: LLM vs. SAST: A Technical Analysis on Detecting Coding Bugs of GPT4-Advanced Data Analysis
- Authors: Madjid G. Tehrani, Eldar Sultanow, William J. Buchanan, Mahkame Houmani, Christel H. Djaha Fodja,
- Abstract summary: GPT-4 (Advanced Data Analysis) outperforms SAST by an accuracy of 94% in detecting 32 types of exploitable vulnerabilities.<n>This study also addresses the potential security concerns surrounding LLMs.
- Score: 0.3495246564946556
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the rapid advancements in Natural Language Processing (NLP), large language models (LLMs) like GPT-4 have gained significant traction in diverse applications, including security vulnerability scanning. This paper investigates the efficacy of GPT-4 in identifying software vulnerabilities compared to traditional Static Application Security Testing (SAST) tools. Drawing from an array of security mistakes, our analysis underscores the potent capabilities of GPT-4 in LLM-enhanced vulnerability scanning. We unveiled that GPT-4 (Advanced Data Analysis) outperforms SAST by an accuracy of 94% in detecting 32 types of exploitable vulnerabilities. This study also addresses the potential security concerns surrounding LLMs, emphasising the imperative of security by design/default and other security best practices for AI.
Related papers
- The Hidden Structure -- Improving Legal Document Understanding Through Explicit Text Formatting [44.99833362998488]
Legal contracts possess an inherent, semantically vital structure (e.g., sections, clauses) that is crucial for human comprehension.<n>This paper investigates the effects of explicit input text structure and prompt engineering on the performance of GPT-4o and GPT-4.1 on a legal question-answering task.
arXiv Detail & Related papers (2025-05-19T08:25:21Z) - Benchmarking Prompt Engineering Techniques for Secure Code Generation with GPT Models [1.0874597293913013]
We implement a benchmark to assess the impact of various prompt engineering strategies on code security.<n>We tested multiple prompt engineering techniques on GPT-3.5-turbo, GPT-4o, and GPT-4o-mini.<n>All tested models demonstrated the ability to detect and repair between 41.9% and 68.7% of vulnerabilities in previously generated code.
arXiv Detail & Related papers (2025-02-09T21:23:07Z) - How Well Do Large Language Models Serve as End-to-End Secure Code Agents for Python? [42.119319820752324]
We studied GPT-3.5 and GPT-4's capability to identify and repair vulnerabilities in the code generated by four popular LLMs.<n>By manually or automatically reviewing 4,900 pieces of code, our study reveals that large language models lack awareness of scenario-relevant security risks.<n>To address the limitation of a single round of repair, we developed a lightweight tool that prompts LLMs to construct safer source code.
arXiv Detail & Related papers (2024-08-20T02:42:29Z) - Can Large Language Models Automatically Jailbreak GPT-4V? [64.04997365446468]
We introduce AutoJailbreak, an innovative automatic jailbreak technique inspired by prompt optimization.
Our experiments demonstrate that AutoJailbreak significantly surpasses conventional methods, achieving an Attack Success Rate (ASR) exceeding 95.3%.
This research sheds light on strengthening GPT-4V security, underscoring the potential for LLMs to be exploited in compromising GPT-4V integrity.
arXiv Detail & Related papers (2024-07-23T17:50:45Z) - Comparison of Static Application Security Testing Tools and Large Language Models for Repo-level Vulnerability Detection [11.13802281700894]
Static Application Security Testing (SAST) is usually utilized to scan source code for security vulnerabilities.
Deep learning (DL)-based methods have demonstrated their potential in software vulnerability detection.
This paper compares 15 diverse SAST tools with 12 popular or state-of-the-art open-source LLMs in detecting software vulnerabilities.
arXiv Detail & Related papers (2024-07-23T07:21:14Z) - Unveiling the Safety of GPT-4o: An Empirical Study using Jailbreak Attacks [65.84623493488633]
This paper conducts a rigorous evaluation of GPT-4o against jailbreak attacks.
The newly introduced audio modality opens up new attack vectors for jailbreak attacks on GPT-4o.
Existing black-box multimodal jailbreak attack methods are largely ineffective against GPT-4o and GPT-4V.
arXiv Detail & Related papers (2024-06-10T14:18:56Z) - IRIS: LLM-Assisted Static Analysis for Detecting Security Vulnerabilities [14.188864624736938]
Large language models (or LLMs) have shown impressive code generation capabilities but they cannot do complex reasoning over code to detect such vulnerabilities.<n>We propose IRIS, a neuro-symbolic approach that systematically combines LLMs with static analysis to perform whole-repository reasoning for security vulnerability detection.
arXiv Detail & Related papers (2024-05-27T14:53:35Z) - Benchmarking GPT-4 on Algorithmic Problems: A Systematic Evaluation of Prompting Strategies [47.129504708849446]
Large Language Models (LLMs) have revolutionized the field of Natural Language Processing.
LLMs lack systematic generalization, which allows to extrapolate the learned statistical regularities outside the training distribution.
In this work, we offer a systematic benchmarking of GPT-4, one of the most advanced LLMs available.
arXiv Detail & Related papers (2024-02-27T10:44:52Z) - An Insight into Security Code Review with LLMs: Capabilities, Obstacles, and Influential Factors [9.309745288471374]
Security code review is a time-consuming and labor-intensive process.<n>Existing security analysis tools struggle with poor generalization, high false positive rates, and coarse detection granularity.<n>Large Language Models (LLMs) have been considered promising candidates for addressing those challenges.
arXiv Detail & Related papers (2024-01-29T17:13:44Z) - The Art of Defending: A Systematic Evaluation and Analysis of LLM
Defense Strategies on Safety and Over-Defensiveness [56.174255970895466]
Large Language Models (LLMs) play an increasingly pivotal role in natural language processing applications.
This paper presents Safety and Over-Defensiveness Evaluation (SODE) benchmark.
arXiv Detail & Related papers (2023-12-30T17:37:06Z) - Can Large Language Models Find And Fix Vulnerable Software? [0.0]
GPT-4 identified approximately four times the vulnerabilities than its counterparts.
It provided viable fixes for each vulnerability, demonstrating a low rate of false positives.
GPT-4's code corrections led to a 90% reduction in vulnerabilities, requiring only an 11% increase in code lines.
arXiv Detail & Related papers (2023-08-20T19:33:12Z) - A LLM Assisted Exploitation of AI-Guardian [57.572998144258705]
We evaluate the robustness of AI-Guardian, a recent defense to adversarial examples published at IEEE S&P 2023.
We write none of the code to attack this model, and instead prompt GPT-4 to implement all attack algorithms following our instructions and guidance.
This process was surprisingly effective and efficient, with the language model at times producing code from ambiguous instructions faster than the author of this paper could have done.
arXiv Detail & Related papers (2023-07-20T17:33:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.