Related papers: It Only Gets Worse: Revisiting DL-Based Vulnerability Detectors from a Practical Perspective

It Only Gets Worse: Revisiting DL-Based Vulnerability Detectors from a Practical Perspective

URL: http://arxiv.org/abs/2507.09529v1
Date: Sun, 13 Jul 2025 08:02:56 GMT
Title: It Only Gets Worse: Revisiting DL-Based Vulnerability Detectors from a Practical Perspective
Authors: Yunqian Wang, Xiaohong Li, Yao Zhang, Yuekang Li, Zhiping Zhou, Ruitao Feng,
Abstract summary: VulTegra compares scratch-trained and pre-trained DL models for vulnerability detection.<n>State-of-the-art (SOTA) detectors still suffer from low consistency, limited real-world capabilities, and scalability challenges.
Score: 14.271145160443462
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the growing threat of software vulnerabilities, deep learning (DL)-based detectors have gained popularity for vulnerability detection. However, doubts remain regarding their consistency within declared CWE ranges, real-world effectiveness, and applicability across scenarios. These issues may lead to unreliable detection, high false positives/negatives, and poor adaptability to emerging vulnerabilities. A comprehensive analysis is needed to uncover critical factors affecting detection and guide improvements in model design and deployment. In this paper, we present VulTegra, a novel evaluation framework that conducts a multidimensional comparison of scratch-trained and pre-trained-based DL models for vulnerability detection. VulTegra reveals that state-of-the-art (SOTA) detectors still suffer from low consistency, limited real-world capabilities, and scalability challenges. Contrary to common belief, pre-trained models are not consistently better than scratch-trained models but exhibit distinct strengths in specific contexts.Importantly, our study exposes the limitations of relying solely on CWE-based classification and identifies key factors that significantly affect model performance. Experimental results show that adjusting just one such factor consistently improves recall across all seven evaluated detectors, with six also achieving better F1 scores. Our findings provide deeper insights into model behavior and emphasize the need to consider both vulnerability types and inherent code features for effective detection.

Related papers

Lie Detector: Unified Backdoor Detection via Cross-Examination Framework [68.45399098884364]
We propose a unified backdoor detection framework in the semi-honest setting.<n>Our method achieves superior detection performance, improving accuracy by 5.4%, 1.6%, and 11.9% over SoTA baselines.<n> Notably, it is the first to effectively detect backdoors in multimodal large language models.
arXiv Detail & Related papers (2025-03-21T06:12:06Z)
One-for-All Does Not Work! Enhancing Vulnerability Detection by Mixture-of-Experts (MoE) [11.69736955814315]
MoEVD decomposes vulnerability detection into two tasks, CWE type classification and CWE-specific vulnerability detection.<n>By splitting the task, in vulnerability detection, MoEVD allows specific experts to handle distinct types of vulnerabilities instead of handling all vulnerabilities within one model.<n>MoEVD excels across almost all CWE types, improving recall over the best SOTA baseline by 9% to 77.8%.
arXiv Detail & Related papers (2025-01-27T19:25:34Z)
Unsupervised Model Diagnosis [49.36194740479798]
This paper proposes Unsupervised Model Diagnosis (UMO) to produce semantic counterfactual explanations without any user guidance. Our approach identifies and visualizes changes in semantics, and then matches these changes to attributes from wide-ranging text sources.
arXiv Detail & Related papers (2024-10-08T17:59:03Z)
Revisiting the Performance of Deep Learning-Based Vulnerability Detection on Realistic Datasets [4.385369356819613]
This paper introduces Real-Vul, a dataset representing real-world scenarios for evaluating vulnerability detection models. evaluating DeepWukong, LineVul, ReVeal, and IVDetect shows a significant drop in performance, with precision decreasing by up to 95 percentage points and F1 scores by up to 91 points. Overfitting is identified as a key issue, and an augmentation technique is proposed, potentially improving performance by up to 30%.
arXiv Detail & Related papers (2024-07-03T13:34:30Z)
Vulnerability Detection with Code Language Models: How Far Are We? [40.455600722638906]
PrimeVul is a new dataset for training and evaluating code LMs for vulnerability detection. It incorporates a novel set of data labeling techniques that achieve comparable label accuracy to human-verified benchmarks. It also implements a rigorous data de-duplication and chronological data splitting strategy to mitigate data leakage issues.
arXiv Detail & Related papers (2024-03-27T14:34:29Z)
Benchmarking Zero-Shot Robustness of Multimodal Foundation Models: A Pilot Study [61.65123150513683]
multimodal foundation models, such as CLIP, produce state-of-the-art zero-shot results. It is reported that these models close the robustness gap by matching the performance of supervised models trained on ImageNet. We show that CLIP leads to a significant robustness drop compared to supervised ImageNet models on our benchmark.
arXiv Detail & Related papers (2024-03-15T17:33:49Z)
Can An Old Fashioned Feature Extraction and A Light-weight Model Improve Vulnerability Type Identification Performance? [6.423483122892239]
We investigate the problem of vulnerability type identification (VTI) We evaluate the performance of the well-known and advanced pre-trained models for VTI on a large set of vulnerabilities. We introduce a lightweight independent component to refine the predictions of the baseline approach.
arXiv Detail & Related papers (2023-06-26T14:28:51Z)
Learning to Quantize Vulnerability Patterns and Match to Locate Statement-Level Vulnerabilities [19.6975205650411]
A vulnerability codebook is learned, which consists of quantized vectors representing various vulnerability patterns. During inference, the codebook is iterated to match all learned patterns and predict the presence of potential vulnerabilities. Our approach was extensively evaluated on a real-world dataset comprising more than 188,000 C/C++ functions.
arXiv Detail & Related papers (2023-05-26T04:13:31Z)
Improving robustness of jet tagging algorithms with adversarial training [56.79800815519762]
We investigate the vulnerability of flavor tagging algorithms via application of adversarial attacks. We present an adversarial training strategy that mitigates the impact of such simulated attacks.
arXiv Detail & Related papers (2022-03-25T19:57:19Z)
Clustering Effect of (Linearized) Adversarial Robust Models [60.25668525218051]
We propose a novel understanding of adversarial robustness and apply it on more tasks including domain adaption and robustness boosting. Experimental evaluations demonstrate the rationality and superiority of our proposed clustering strategy.
arXiv Detail & Related papers (2021-11-25T05:51:03Z)
Accurate and Robust Feature Importance Estimation under Distribution Shifts [49.58991359544005]
PRoFILE is a novel feature importance estimation method. We show significant improvements over state-of-the-art approaches, both in terms of fidelity and robustness.
arXiv Detail & Related papers (2020-09-30T05:29:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.