A Large-Scale Collection Of (Non-)Actionable Static Code Analysis Reports
- URL: http://arxiv.org/abs/2511.10323v1
- Date: Fri, 14 Nov 2025 01:45:33 GMT
- Title: A Large-Scale Collection Of (Non-)Actionable Static Code Analysis Reports
- Authors: Dávid Kószó, Tamás Aladics, Rudolf Ferenc, Péter Hegedűs,
- Abstract summary: Static Code Analysis (SCA) tools often generate an overwhelming number of warnings, many of which are non-actionable.<n>This overload of alerts leads to alert fatigue'', a phenomenon where developers become desensitized to warnings, potentially overlooking critical issues and ultimately hindering productivity and code quality.<n>We introduce a novel methodology for collecting and categorizing SCA warnings, effectively distinguishing actionable from non-actionable ones.<n>We generate a large-scale dataset of over 1 million entries of Java source code warnings, named NASCAR: (Non-)Actionable Static Code Analysis Reports.
- Score: 0.05599792629509228
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Static Code Analysis (SCA) tools, while invaluable for identifying potential coding problems, functional bugs, or vulnerabilities, often generate an overwhelming number of warnings, many of which are non-actionable. This overload of alerts leads to ``alert fatigue'', a phenomenon where developers become desensitized to warnings, potentially overlooking critical issues and ultimately hindering productivity and code quality. Analyzing these warnings and training machine learning models to identify and filter them requires substantial datasets, which are currently scarce, particularly for Java. This scarcity impedes efforts to improve the accuracy and usability of SCA tools and mitigate the effects of alert fatigue. In this paper, we address this gap by introducing a novel methodology for collecting and categorizing SCA warnings, effectively distinguishing actionable from non-actionable ones. We further leverage this methodology to generate a large-scale dataset of over 1 million entries of Java source code warnings, named NASCAR: (Non-)Actionable Static Code Analysis Reports. To facilitate follow-up research in this domain, we make both the dataset and the tools used to generate it publicly available.
Related papers
- Multi-Agent Taint Specification Extraction for Vulnerability Detection [49.27772068704498]
Static Application Security Testing (SAST) tools using taint analysis are widely viewed as providing higher-quality vulnerability detection results.<n>We present SemTaint, a multi-agent system that strategically combines the semantic understanding of Large Language Models (LLMs) with traditional static program analysis.<n>We integrate SemTaint with CodeQL, a state-of-the-art SAST tool, and demonstrate its effectiveness by detecting 106 of 162 vulnerabilities previously undetectable by CodeQL.
arXiv Detail & Related papers (2026-01-15T21:31:51Z) - The Trojan Knowledge: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search [58.8834056209347]
Large language models (LLMs) remain vulnerable to jailbreak attacks that bypass safety guardrails to elicit harmful outputs.<n>We introduce the Correlated Knowledge Attack Agent (CKA-Agent), a dynamic framework that reframes jailbreaking as an adaptive, tree-structured exploration of the target model's knowledge base.
arXiv Detail & Related papers (2025-12-01T07:05:23Z) - From Model to Breach: Towards Actionable LLM-Generated Vulnerabilities Reporting [43.57360781012506]
We show that even the latest open-weight models are vulnerable in the earliest reported vulnerability scenarios.<n>We introduce a new severity metric that reflects the risk posed by an LLM-generated vulnerability.<n>To encourage the mitigation of the most serious and prevalent vulnerabilities, we use PE to define the Model Exposure (ME) score.
arXiv Detail & Related papers (2025-11-06T16:52:27Z) - Weakly Supervised Vulnerability Localization via Multiple Instance Learning [46.980136742826836]
We propose a novel approach called WAVES for WeAkly supervised Vulnerability localization via multiplE inStance learning.<n>WAVES has the capability to determine whether a function is vulnerable (i.e., vulnerability detection) and pinpoint the vulnerable statements.<n>Our approach achieves comparable performance in vulnerability detection and state-of-the-art performance in statement-level vulnerability localization.
arXiv Detail & Related papers (2025-09-14T15:11:39Z) - Trace Gadgets: Minimizing Code Context for Machine Learning-Based Vulnerability Prediction [8.056137513320065]
This work introduces Trace Gadgets, a novel code representation that minimizes code context by removing non-related code.<n>As input for ML models, Trace Gadgets provide a minimal but complete context, thereby improving the detection performance.<n>Our results show that state-of-the-art machine learning models perform best when using Trace Gadgets compared to previous code representations.
arXiv Detail & Related papers (2025-04-18T13:13:39Z) - Vulnerability Detection with Code Language Models: How Far Are We? [40.455600722638906]
PrimeVul is a new dataset for training and evaluating code LMs for vulnerability detection.
It incorporates a novel set of data labeling techniques that achieve comparable label accuracy to human-verified benchmarks.
It also implements a rigorous data de-duplication and chronological data splitting strategy to mitigate data leakage issues.
arXiv Detail & Related papers (2024-03-27T14:34:29Z) - FineWAVE: Fine-Grained Warning Verification of Bugs for Automated Static Analysis Tools [18.927121513404924]
Automated Static Analysis Tools (ASATs) have evolved over time to assist in detecting bugs.
Previous research efforts have explored learning-based methods to validate the reported warnings.
We propose FineWAVE, a learning-based approach that verifies bug-sensitive warnings at a fine-grained granularity.
arXiv Detail & Related papers (2024-03-24T06:21:35Z) - Robust Tiny Object Detection in Aerial Images amidst Label Noise [50.257696872021164]
This study addresses the issue of tiny object detection under noisy label supervision.
We propose a DeNoising Tiny Object Detector (DN-TOD), which incorporates a Class-aware Label Correction scheme.
Our method can be seamlessly integrated into both one-stage and two-stage object detection pipelines.
arXiv Detail & Related papers (2024-01-16T02:14:33Z) - Tracking the Evolution of Static Code Warnings: the State-of-the-Art and
a Better Approach [18.350023994564904]
Static bug detection tools help developers detect problems in the code, including bad programming practices and potential defects.
Recent efforts to integrate static bug detectors in modern software development, such as in code review and continuous integration, are shown to better motivate developers to fix the reported warnings on the fly.
arXiv Detail & Related papers (2022-10-06T03:02:32Z) - Annotation Error Detection: Analyzing the Past and Present for a More
Coherent Future [63.99570204416711]
We reimplement 18 methods for detecting potential annotation errors and evaluate them on 9 English datasets.
We define a uniform evaluation setup including a new formalization of the annotation error detection task.
We release our datasets and implementations in an easy-to-use and open source software package.
arXiv Detail & Related papers (2022-06-05T22:31:45Z) - How to Find Actionable Static Analysis Warnings [28.866251060033537]
We show that effective predictors of such warnings can be created by methods that adjust the decision boundary.
For eight open-source Java projects (CASSANDRA, JMETER, COMMONS, LUCENE-SOLR, ANT, TOMCAT, DERBY) we achieve perfect test results on 4/8 datasets.
arXiv Detail & Related papers (2022-05-21T04:47:02Z) - Learning to Reduce False Positives in Analytic Bug Detectors [12.733531603080674]
We propose a Transformer-based learning approach to identify false positive bug warnings.
We demonstrate that our models can improve the precision of static analysis by 17.5%.
arXiv Detail & Related papers (2022-03-08T04:26:26Z) - Software Vulnerability Detection via Deep Learning over Disaggregated
Code Graph Representation [57.92972327649165]
This work explores a deep learning approach to automatically learn the insecure patterns from code corpora.
Because code naturally admits graph structures with parsing, we develop a novel graph neural network (GNN) to exploit both the semantic context and structural regularity of a program.
arXiv Detail & Related papers (2021-09-07T21:24:36Z) - Assessing Validity of Static Analysis Warnings using Ensemble Learning [4.05739885420409]
Static Analysis (SA) tools are used to identify potential weaknesses in code and fix them in advance, while the code is being developed.
These rules-based static analysis tools generally report a lot of false warnings along with the actual ones.
We propose a Machine Learning (ML)-based learning process that uses source codes, historic commit data, and classifier-ensembles to prioritize the True warnings.
arXiv Detail & Related papers (2021-04-21T19:39:20Z) - D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using
Differential Analysis [55.15995704119158]
We propose D2A, a differential analysis based approach to label issues reported by static analysis tools.
We use D2A to generate a large labeled dataset to train models for vulnerability identification.
arXiv Detail & Related papers (2021-02-16T07:46:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.