Vulnerability Detection in Popular Programming Languages with Language Models
- URL: http://arxiv.org/abs/2412.15905v2
- Date: Mon, 23 Dec 2024 06:12:25 GMT
- Title: Vulnerability Detection in Popular Programming Languages with Language Models
- Authors: Syafiq Al Atiiq, Christian Gehrmann, Kevin Dahlén,
- Abstract summary: This paper investigates the effectiveness of Language Models (LMs) for vulnerability detection in JavaScript, Java, Python, PHP, and Go.
We find that JavaScript exhibits the best performance, with considerably better and more practical detection capabilities compared to C/C++.
- Score: 2.048226951354646
- License:
- Abstract: Vulnerability detection is crucial for maintaining software security, and recent research has explored the use of Language Models (LMs) for this task. While LMs have shown promising results, their performance has been inconsistent across datasets, particularly when generalizing to unseen code. Moreover, most studies have focused on the C/C++ programming language, with limited attention given to other popular languages. This paper addresses this gap by investigating the effectiveness of LMs for vulnerability detection in JavaScript, Java, Python, PHP, and Go, in addition to C/C++ for comparison. We utilize the CVEFixes dataset to create a diverse collection of language-specific vulnerabilities and preprocess the data to ensure quality and integrity. We fine-tune and evaluate state-of-the-art LMs across the selected languages and find that the performance of vulnerability detection varies significantly. JavaScript exhibits the best performance, with considerably better and more practical detection capabilities compared to C/C++. We also examine the relationship between code complexity and detection performance across the six languages and find only a weak correlation between code complexity metrics and the models' F1 scores.
Related papers
- Enhancing Code Generation for Low-Resource Languages: No Silver Bullet [55.39571645315926]
Large Language Models (LLMs) rely on large and diverse datasets to learn syntax, semantics, and usage patterns of programming languages.
For low-resource languages, the limited availability of such data hampers the models' ability to generalize effectively.
We present an empirical study investigating the effectiveness of several approaches for boosting LLMs' performance on low-resource languages.
arXiv Detail & Related papers (2025-01-31T12:23:28Z) - Bridging the Language Gaps in Large Language Models with Inference-Time Cross-Lingual Intervention [71.12193680015622]
Large Language Models (LLMs) have shown remarkable capabilities in natural language processing.
LLMs exhibit significant performance gaps among different languages.
We propose Inference-Time Cross-Lingual Intervention (INCLINE) to overcome these limitations without incurring significant costs.
arXiv Detail & Related papers (2024-10-16T11:23:03Z) - CRUXEval-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution [50.7413285637879]
The CRUXEVAL-X code reasoning benchmark contains 19 programming languages.
It comprises at least 600 subjects for each language, along with 19K content-consistent tests in total.
Even a model trained solely on Python can achieve at most 34.4% Pass@1 in other languages.
arXiv Detail & Related papers (2024-08-23T11:43:00Z) - Large Language Models for Secure Code Assessment: A Multi-Language Empirical Study [1.9116784879310031]
We show that GPT-4o achieves the highest vulnerability detection and CWE classification scores using a few-shot setting.
We develop a library called CODEGUARDIAN integrated with VSCode which enables developers to perform LLM-assisted real-time vulnerability analysis.
arXiv Detail & Related papers (2024-08-12T18:10:11Z) - Assessing Code Generation with Intermediate Languages [6.999311675957218]
This study explores the utilization of intermediate languages, including various programming languages, natural language solutions, and pseudo-code.
Our findings reveal that intermediate languages generally exhibit greater efficacy in larger models that have not yet achieved state-of-the-art performance.
arXiv Detail & Related papers (2024-07-07T15:35:41Z) - Quantifying Contamination in Evaluating Code Generation Capabilities of
Language Models [27.24738197172374]
Large language models have achieved remarkable performance on various code generation benchmarks.
There have been growing concerns regarding potential contamination of these benchmarks as they may be leaked into pretraining and finetuning data.
We show that there are substantial overlap between popular code generation benchmarks and open training corpus, and models perform significantly better on the subset of the benchmarks where similar solutions are seen during training.
arXiv Detail & Related papers (2024-03-06T21:45:35Z) - Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities [12.82645410161464]
We evaluate the effectiveness of 16 pre-trained Large Language Models on 5,000 code samples from five diverse security datasets.
Overall, LLMs show modest effectiveness in detecting vulnerabilities, obtaining an average accuracy of 62.8% and F1 score of 0.71 across datasets.
We find that advanced prompting strategies that involve step-by-step analysis significantly improve performance of LLMs on real-world datasets in terms of F1 score (by upto 0.18 on average)
arXiv Detail & Related papers (2023-11-16T13:17:20Z) - AdaCCD: Adaptive Semantic Contrasts Discovery Based Cross Lingual
Adaptation for Code Clone Detection [69.79627042058048]
AdaCCD is a novel cross-lingual adaptation method that can detect cloned codes in a new language without annotations in that language.
We evaluate the cross-lingual adaptation results of AdaCCD by constructing a multilingual code clone detection benchmark consisting of 5 programming languages.
arXiv Detail & Related papers (2023-11-13T12:20:48Z) - Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of
Language Models [86.02610674750345]
Adversarial GLUE (AdvGLUE) is a new multi-task benchmark to explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks.
We apply 14 adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations.
All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy.
arXiv Detail & Related papers (2021-11-04T12:59:55Z) - X-FACTR: Multilingual Factual Knowledge Retrieval from Pretrained
Language Models [103.75890012041366]
Language models (LMs) have proven surprisingly successful at capturing factual knowledge.
However, studies on LMs' factual representation ability have almost invariably been performed on English.
We create a benchmark of cloze-style probes for 23 typologically diverse languages.
arXiv Detail & Related papers (2020-10-13T05:29:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.