Related papers: Evaluating Line-level Localization Ability of Learning-based Code Vulnerability Detection Models

Evaluating Line-level Localization Ability of Learning-based Code Vulnerability Detection Models

URL: http://arxiv.org/abs/2510.11202v1
Date: Mon, 13 Oct 2025 09:34:40 GMT
Title: Evaluating Line-level Localization Ability of Learning-based Code Vulnerability Detection Models
Authors: Marco Pintore, Giorgio Piras, Angelo Sotgiu, Maura Pintor, Battista Biggio,
Abstract summary: We propose an explainability-based evaluation procedure for vulnerability detectors.<n>Our approach, defined as Detection Alignment (DA), quantifies the agreement between the input source code lines.<n>We show how the predictions of such models are consistently biased by non-vulnerable lines.
Score: 9.543689542888599
License: http://creativecommons.org/licenses/by/4.0/
Abstract: To address the extremely concerning problem of software vulnerability, system security is often entrusted to Machine Learning (ML) algorithms. Despite their now established detection capabilities, such models are limited by design to flagging the entire input source code function as vulnerable, rather than precisely localizing the concerned code lines. However, the detection granularity is crucial to support human operators during software development, ensuring that such predictions reflect the true code semantics to help debug, evaluate, and fix the detected vulnerabilities. To address this issue, recent work made progress toward improving the detector's localization ability, thus narrowing down the vulnerability detection "window" and providing more fine-grained predictions. Such approaches, however, implicitly disregard the presence of spurious correlations and biases in the data, which often predominantly influence the performance of ML algorithms. In this work, we investigate how detectors comply with this requirement by proposing an explainability-based evaluation procedure. Our approach, defined as Detection Alignment (DA), quantifies the agreement between the input source code lines that most influence the prediction and the actual localization of the vulnerability as per the ground truth. Through DA, which is model-agnostic and adaptable to different detection tasks, not limited to our use case, we analyze multiple learning-based vulnerability detectors and datasets. As a result, we show how the predictions of such models are consistently biased by non-vulnerable lines, ultimately highlighting the high impact of biases and spurious correlations. The code is available at https://github.com/pralab/vuln-localization-eval.

Related papers

Adapting Language Balance in Code-Switching Speech [60.296574524609575]
Large foundational models still struggle against code-switching test cases.<n>We use differentiable surrogates to mitigate context bias during generation.<n>Experiments with Arabic and Chinese-English showed that the models are able to predict the switching places more correctly.
arXiv Detail & Related papers (2025-10-21T15:23:55Z)
VulAgent: Hypothesis-Validation based Multi-Agent Vulnerability Detection [55.957275374847484]
VulAgent is a multi-agent vulnerability detection framework based on hypothesis validation.<n>It implements a semantics-sensitive, multi-view detection pipeline, each aligned to a specific analysis perspective.<n>On average, VulAgent improves overall accuracy by 6.6%, increases the correct identification rate of vulnerable--fixed code pairs by up to 450%, and reduces the false positive rate by about 36%.
arXiv Detail & Related papers (2025-09-15T02:25:38Z)
Lie Detector: Unified Backdoor Detection via Cross-Examination Framework [68.45399098884364]
We propose a unified backdoor detection framework in the semi-honest setting.<n>Our method achieves superior detection performance, improving accuracy by 5.4%, 1.6%, and 11.9% over SoTA baselines.<n> Notably, it is the first to effectively detect backdoors in multimodal large language models.
arXiv Detail & Related papers (2025-03-21T06:12:06Z)
Are We Learning the Right Features? A Framework for Evaluating DL-Based Software Vulnerability Detection Solutions [3.204048014949849]
This paper aims to provide the foundation for properly evaluating the research in this domain.<n>We analyze vulnerability datasets for the syntactic and semantic features of code that contribute to vulnerability.<n>We provide a novel, uniform representation to capture both sets of features, and use this representation to detect the presence of both vulnerability and spurious features in code.
arXiv Detail & Related papers (2025-01-23T00:32:15Z)
Analyzing Adversarial Inputs in Deep Reinforcement Learning [53.3760591018817]
We present a comprehensive analysis of the characterization of adversarial inputs, through the lens of formal verification. We introduce a novel metric, the Adversarial Rate, to classify models based on their susceptibility to such perturbations. Our analysis empirically demonstrates how adversarial inputs can affect the safety of a given DRL system with respect to such perturbations.
arXiv Detail & Related papers (2024-02-07T21:58:40Z)
Uncertainty Awareness of Large Language Models Under Code Distribution Shifts: A Benchmark Study [14.507068647009602]
Large Language Models (LLMs) have been widely employed in programming language analysis to enhance human productivity. Their reliability can be compromised by various code distribution shifts, leading to inconsistent outputs. Probability methods are known to mitigate such impact through uncertainty calibration and estimation.
arXiv Detail & Related papers (2024-01-12T00:00:32Z)
Beyond Fidelity: Explaining Vulnerability Localization of Learning-based Detectors [10.316819421902363]
Vulnerability detectors based on deep learning (DL) models have proven their effectiveness in recent years. The shroud of opacity surrounding the decision-making process of these detectors makes it difficult for security analysts to comprehend. We evaluate the performance of ten explanation approaches for vulnerability detectors based on graph and sequence representations.
arXiv Detail & Related papers (2024-01-05T07:37:35Z)
Decomposing Uncertainty for Large Language Models through Input Clarification Ensembling [69.83976050879318]
In large language models (LLMs), identifying sources of uncertainty is an important step toward improving reliability, trustworthiness, and interpretability. In this paper, we introduce an uncertainty decomposition framework for LLMs, called input clarification ensembling. Our approach generates a set of clarifications for the input, feeds them into an LLM, and ensembles the corresponding predictions.
arXiv Detail & Related papers (2023-11-15T05:58:35Z)
A Call to Reflect on Evaluation Practices for Failure Detection in Image Classification [0.491574468325115]
We present a large-scale empirical study for the first time enabling benchmarking confidence scoring functions. The revelation of a simple softmax response baseline as the overall best performing method underlines the drastic shortcomings of current evaluation.
arXiv Detail & Related papers (2022-11-28T12:25:27Z)
A Review of Uncertainty Calibration in Pretrained Object Detectors [5.440028715314566]
We investigate the uncertainty calibration properties of different pretrained object detection architectures in a multi-class setting. We propose a framework to ensure a fair, unbiased, and repeatable evaluation. We deliver novel insights into why poor detector calibration emerges.
arXiv Detail & Related papers (2022-10-06T14:06:36Z)
Self-Supervised Training with Autoencoders for Visual Anomaly Detection [61.62861063776813]
We focus on a specific use case in anomaly detection where the distribution of normal samples is supported by a lower-dimensional manifold. We adapt a self-supervised learning regime that exploits discriminative information during training but focuses on the submanifold of normal examples. We achieve a new state-of-the-art result on the MVTec AD dataset -- a challenging benchmark for visual anomaly detection in the manufacturing domain.
arXiv Detail & Related papers (2022-06-23T14:16:30Z)
Accurate and Robust Feature Importance Estimation under Distribution Shifts [49.58991359544005]
PRoFILE is a novel feature importance estimation method. We show significant improvements over state-of-the-art approaches, both in terms of fidelity and robustness.
arXiv Detail & Related papers (2020-09-30T05:29:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.