Evaluating Large Language Models in Vulnerability Detection Under Variable Context Windows
- URL: http://arxiv.org/abs/2502.00064v1
- Date: Thu, 30 Jan 2025 20:44:46 GMT
- Title: Evaluating Large Language Models in Vulnerability Detection Under Variable Context Windows
- Authors: Jie Lin, David Mohaisen,
- Abstract summary: This study examines the impact of tokenized Java code length on the accuracy and explicitness of ten major LLMs in vulnerability detection.<n>We found inconsistencies across models: some, like GPT-4, Mistral, and Mixtral, showed robustness, while others exhibited a significant link between tokenized length and performance.
- Score: 17.088307683654577
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This study examines the impact of tokenized Java code length on the accuracy and explicitness of ten major LLMs in vulnerability detection. Using chi-square tests and known ground truth, we found inconsistencies across models: some, like GPT-4, Mistral, and Mixtral, showed robustness, while others exhibited a significant link between tokenized length and performance. We recommend future LLM development focus on minimizing the influence of input length for better vulnerability detection. Additionally, preprocessing techniques that reduce token count while preserving code structure could enhance LLM accuracy and explicitness in these tasks.
Related papers
- Everything You Wanted to Know About LLM-based Vulnerability Detection But Were Afraid to Ask [30.819697001992154]
Large Language Models are a promising tool for automated vulnerability detection.
Despite widespread adoption, a critical question remains: Are LLMs truly effective at detecting real-world vulnerabilities?
This paper challenges three widely held community beliefs: that LLMs are (i) unreliable, (ii) insensitive to code patches, and (iii) performance-plateaued across model scales.
arXiv Detail & Related papers (2025-04-18T05:32:47Z) - ObscuraCoder: Powering Efficient Code LM Pre-Training Via Obfuscation Grounding [60.37988508851391]
Language models (LMs) have become a staple of the code-writing toolbox.
Research exploring modifications to Code-LMs' pre-training objectives, geared towards improving data efficiency and better disentangling between syntax and semantics, has been noticeably sparse.
In this work, we examine grounding on obfuscated code as a means of helping Code-LMs look beyond the surface-form syntax and enhance their pre-training sample efficiency.
arXiv Detail & Related papers (2025-03-27T23:08:53Z) - CASTLE: Benchmarking Dataset for Static Code Analyzers and LLMs towards CWE Detection [2.5228276786940182]
This paper introduces CASTLE, a benchmarking framework for evaluating the vulnerability detection capabilities of different methods.
We assess 13 static analysis tools, 10 LLMs, and 2 formal verification tools using a hand-crafted dataset of 250 micro-benchmark programs covering 25 common CWEs.
arXiv Detail & Related papers (2025-03-12T14:30:05Z) - What You See Is Not Always What You Get: An Empirical Study of Code Comprehension by Large Language Models [0.5735035463793009]
We study the specific LLM vulnerability to imperceptible character attacks, a type of prompt-injection attack that uses special characters to befuddle an LLM.<n>We devise four categories of attacks and investigate their effects on the performance outcomes of tasks relating to code analysis and code comprehension.
arXiv Detail & Related papers (2024-12-11T04:52:41Z) - If LLMs Would Just Look: Simple Line-by-line Checking Improves Vulnerability Localization [33.42741297088634]
Traditional methods for vulnerability localization, such as manual code audits or rule-based tools, are often time-consuming and limited in scope.
This paper introduces LOVA, a novel framework leveraging the self-attention mechanisms inherent in large language models.
We demonstrate that LOVA significantly outperforms existing LLM-based approaches, achieving up to a 5.3x improvement in F1-scores.
arXiv Detail & Related papers (2024-10-20T05:02:18Z) - Exploring Automatic Cryptographic API Misuse Detection in the Era of LLMs [60.32717556756674]
This paper introduces a systematic evaluation framework to assess Large Language Models in detecting cryptographic misuses.
Our in-depth analysis of 11,940 LLM-generated reports highlights that the inherent instabilities in LLMs can lead to over half of the reports being false positives.
The optimized approach achieves a remarkable detection rate of nearly 90%, surpassing traditional methods and uncovering previously unknown misuses in established benchmarks.
arXiv Detail & Related papers (2024-07-23T15:31:26Z) - M2CVD: Enhancing Vulnerability Semantic through Multi-Model Collaboration for Code Vulnerability Detection [52.4455893010468]
Large Language Models (LLMs) have strong capabilities in code comprehension, but fine-tuning costs and semantic alignment issues limit their project-specific optimization.
Code models such CodeBERT are easy to fine-tune, but it is often difficult to learn vulnerability semantics from complex code languages.
This paper introduces the Multi-Model Collaborative Vulnerability Detection approach (M2CVD) to improve the detection accuracy of code models.
arXiv Detail & Related papers (2024-06-10T00:05:49Z) - Security Vulnerability Detection with Multitask Self-Instructed Fine-Tuning of Large Language Models [8.167614500821223]
We introduce MSIVD, multitask self-instructed fine-tuning for vulnerability detection, inspired by chain-of-thought prompting and LLM self-instruction.
Our experiments demonstrate that MSIVD achieves superior performance, outperforming the highest LLM-based vulnerability detector baseline (LineVul) with a F1 score of 0.92 on the BigVul dataset, and 0.48 on the PreciseBugs dataset.
arXiv Detail & Related papers (2024-06-09T19:18:05Z) - Characterizing Truthfulness in Large Language Model Generations with
Local Intrinsic Dimension [63.330262740414646]
We study how to characterize and predict the truthfulness of texts generated from large language models (LLMs)
We suggest investigating internal activations and quantifying LLM's truthfulness using the local intrinsic dimension (LID) of model activations.
arXiv Detail & Related papers (2024-02-28T04:56:21Z) - Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models [48.35385912526338]
This paper explores the impact of extending input lengths on the capabilities of Large Language Models (LLMs)
We isolate the effect of input length using multiple versions of the same sample, each being extended with padding of different lengths, types and locations.
We show that the degradation trend appears in every version of our dataset, although at different intensities.
arXiv Detail & Related papers (2024-02-19T16:04:53Z) - Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities [12.82645410161464]
We evaluate the effectiveness of 16 pre-trained Large Language Models on 5,000 code samples from five diverse security datasets.
Overall, LLMs show modest effectiveness in detecting vulnerabilities, obtaining an average accuracy of 62.8% and F1 score of 0.71 across datasets.
We find that advanced prompting strategies that involve step-by-step analysis significantly improve performance of LLMs on real-world datasets in terms of F1 score (by upto 0.18 on average)
arXiv Detail & Related papers (2023-11-16T13:17:20Z) - Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools.
Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions.
Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z) - To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis [50.31589712761807]
Large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is approaching its scaling limit for LLMs.
We investigate the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting.
Second, we examine the key factors contributing to multi-epoch degradation, finding that significant factors include dataset size, model parameters, and training objectives.
arXiv Detail & Related papers (2023-05-22T17:02:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.