Out of Distribution, Out of Luck: How Well Can LLMs Trained on Vulnerability Datasets Detect Top 25 CWE Weaknesses?
- URL: http://arxiv.org/abs/2507.21817v1
- Date: Tue, 29 Jul 2025 13:51:46 GMT
- Title: Out of Distribution, Out of Luck: How Well Can LLMs Trained on Vulnerability Datasets Detect Top 25 CWE Weaknesses?
- Authors: Yikun Li, Ngoc Tan Bui, Ting Zhang, Martin Weyssow, Chengran Yang, Xin Zhou, Jinfeng Jiang, Junkai Chen, Huihui Huang, Huu Hung Nguyen, Chiok Yew Ho, Jie Tan, Ruiyin Li, Yide Yin, Han Wei Ang, Frank Liauw, Eng Lieh Ouh, Lwin Khin Shar, David Lo,
- Abstract summary: We introduce a manually curated test dataset, BenchVul, covering the MITRE Top 25 Most Dangerous CWEs.<n>Second, we construct a high-quality training dataset, TitanVul, comprising 35,045 functions by aggregating seven public sources.<n>Third, we propose a Realistic Vulnerability Generation (RVG) framework, which synthesizes context-aware vulnerability examples through simulated development.
- Score: 15.433632243968137
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automated vulnerability detection research has made substantial progress, yet its real-world impact remains limited. Current vulnerability datasets suffer from issues including label inaccuracy rates of 20-71%, extensive duplication, and poor coverage of critical CWE types. These issues create a significant "generalization gap" where models achieve misleading self-testing performance (measured on held-out data from same dataset for training) by exploiting spurious correlations rather than learning true vulnerability patterns. Our analysis reveals that many models experience substantial performance drops of up to 40.6% when evaluated on independent data, sometimes underperforming random guessing. To address these limitations, we present a three-part solution. First, we introduce a manually curated test dataset, BenchVul, covering the MITRE Top 25 Most Dangerous CWEs. Second, we construct a high-quality training dataset, TitanVul, comprising 35,045 functions by aggregating seven public sources and applying deduplication and validation using a novel multi-agent LLM framework. Third, we propose a Realistic Vulnerability Generation (RVG) framework, which synthesizes context-aware vulnerability examples for underrepresented but critical CWE types through simulated development workflows. Our evaluation shows the strengths of each component in closing the generalization gap. First, BenchVul shows the limitations of self-testing: models trained on existing datasets, such as BigVul and PrimeVul, experience performance drops on BenchVul (from 0.776 to 0.519 and from 0.567 to 0.337). Second, training models on TitanVul demonstrates improved generalization, with model performance increasing from 0.584 when evaluated on the same dataset to 0.767 when tested on BenchVul. Third, supplementing TitanVul with RVG-generated data yields further gains, increasing model performance by 14.0% to 0.874.
Related papers
- VLAI: A RoBERTa-Based Model for Automated Vulnerability Severity Classification [49.1574468325115]
Built on RoBERTa, VLAI is fine-tuned on over 600,000 real-world vulnerabilities.<n>The model and dataset are open-source and integrated into the Vulnerability-Lookup service.
arXiv Detail & Related papers (2025-07-04T14:28:14Z) - SecVulEval: Benchmarking LLMs for Real-World C/C++ Vulnerability Detection [8.440793630384546]
Large Language Models (LLMs) have shown promise in software engineering tasks.<n> evaluating their effectiveness in vulnerability detection is challenging due to the lack of high-quality datasets.<n>This benchmark includes 25,440 function samples covering 5,867 unique CVEs in C/C++ projects from 1999 to 2024.
arXiv Detail & Related papers (2025-05-26T11:06:03Z) - Benchmarking Reasoning Robustness in Large Language Models [76.79744000300363]
We find significant performance degradation on novel or incomplete data.<n>These findings highlight the reliance on recall over rigorous logical inference.<n>This paper introduces a novel benchmark, termed as Math-RoB, that exploits hallucinations triggered by missing information to expose reasoning gaps.
arXiv Detail & Related papers (2025-03-06T15:36:06Z) - Towards Robust Universal Information Extraction: Benchmark, Evaluation, and Solution [66.11004226578771]
Existing robust benchmark datasets have two key limitations.<n>They generate only a limited range of perturbations for a single Information Extraction (IE) task.<n>Considering the powerful generation capabilities of Large Language Models (LLMs), we introduce a new benchmark dataset for Robust UIE, called RUIE-Bench.<n>We show that training with only textbf15% of the data leads to an average textbf7.5% relative performance improvement across three IE tasks.
arXiv Detail & Related papers (2025-03-05T05:39:29Z) - CleanVul: Automatic Function-Level Vulnerability Detection in Code Commits Using LLM Heuristics [12.053158610054911]
This paper introduces the first methodology that uses the Large Language Model (LLM) with a enhancement to automatically identify vulnerability-fixing changes from VFCs.<n>VulSifter was applied to a large-scale study, where we conducted a crawl of 127,063 repositories on GitHub.<n>We then developed CleanVul, a high-quality dataset comprising 8,203 functions.
arXiv Detail & Related papers (2024-11-26T09:51:55Z) - VulScribeR: Exploring RAG-based Vulnerability Augmentation with LLMs [19.45598962972431]
VulScribeR is a novel solution that leverages carefully curated prompt templates to augment vulnerable datasets.<n>Our approach beats two SOTA methods Vulgen and VGX, and Random Oversampling (ROS) by 27.48%, 27.93%, and 15.41% in f1-score with 5K generated vulnerable samples.<n>Our approach demonstrates its feasibility for large-scale data augmentation by generating 1K samples at as cheap as US$ 1.88.
arXiv Detail & Related papers (2024-08-07T23:22:58Z) - Revisiting the Performance of Deep Learning-Based Vulnerability Detection on Realistic Datasets [4.385369356819613]
This paper introduces Real-Vul, a dataset representing real-world scenarios for evaluating vulnerability detection models.
evaluating DeepWukong, LineVul, ReVeal, and IVDetect shows a significant drop in performance, with precision decreasing by up to 95 percentage points and F1 scores by up to 91 points.
Overfitting is identified as a key issue, and an augmentation technique is proposed, potentially improving performance by up to 30%.
arXiv Detail & Related papers (2024-07-03T13:34:30Z) - No "Zero-Shot" Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance [68.18779562801762]
multimodal models require exponentially more data to achieve linear improvements in downstream "zero-shot" performance.
Our study reveals an exponential need for training data which implies that the key to "zero-shot" generalization capabilities under large-scale training paradigms remains to be found.
arXiv Detail & Related papers (2024-04-04T17:58:02Z) - Vulnerability Detection with Code Language Models: How Far Are We? [40.455600722638906]
PrimeVul is a new dataset for training and evaluating code LMs for vulnerability detection.
It incorporates a novel set of data labeling techniques that achieve comparable label accuracy to human-verified benchmarks.
It also implements a rigorous data de-duplication and chronological data splitting strategy to mitigate data leakage issues.
arXiv Detail & Related papers (2024-03-27T14:34:29Z) - Benchmarking Zero-Shot Robustness of Multimodal Foundation Models: A Pilot Study [61.65123150513683]
multimodal foundation models, such as CLIP, produce state-of-the-art zero-shot results.
It is reported that these models close the robustness gap by matching the performance of supervised models trained on ImageNet.
We show that CLIP leads to a significant robustness drop compared to supervised ImageNet models on our benchmark.
arXiv Detail & Related papers (2024-03-15T17:33:49Z) - Conservative Prediction via Data-Driven Confidence Minimization [70.93946578046003]
In safety-critical applications of machine learning, it is often desirable for a model to be conservative.
We propose the Data-Driven Confidence Minimization framework, which minimizes confidence on an uncertainty dataset.
arXiv Detail & Related papers (2023-06-08T07:05:36Z) - TACRED Revisited: A Thorough Evaluation of the TACRED Relation
Extraction Task [80.38130122127882]
TACRED is one of the largest, most widely used crowdsourced datasets in Relation Extraction (RE)
In this paper, we investigate the questions: Have we reached a performance ceiling or is there still room for improvement?
We find that label errors account for 8% absolute F1 test error, and that more than 50% of the examples need to be relabeled.
arXiv Detail & Related papers (2020-04-30T15:07:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.