Related papers: An Empirical Study of False Negatives and Positives of Static Code Analyzers From the Perspective of Historical Issues

An Empirical Study of False Negatives and Positives of Static Code Analyzers From the Perspective of Historical Issues

URL: http://arxiv.org/abs/2408.13855v1
Date: Sun, 25 Aug 2024 14:57:59 GMT
Title: An Empirical Study of False Negatives and Positives of Static Code Analyzers From the Perspective of Historical Issues
Authors: Han Cui, Menglei Xie, Ting Su, Chengyu Zhang, Shin Hwei Tan,
Abstract summary: We conduct the first systematic study on a broad range of 350 historical issues of false negatives (FNs) and false positives (FPs) from three popular static code analyzers. This strategy successfully found 14 new issues of FNs/FPs, 11 of which have been confirmed and 9 have already been fixed by the developers.
Score: 6.463945330904755
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Static code analyzers are widely used to help find program flaws. However, in practice the effectiveness and usability of such analyzers is affected by the problems of false negatives (FNs) and false positives (FPs). This paper aims to investigate the FNs and FPs of such analyzers from a new perspective, i.e., examining the historical issues of FNs and FPs of these analyzers reported by the maintainers, users and researchers in their issue repositories -- each of these issues manifested as a FN or FP of these analyzers in the history and has already been confirmed and fixed by the analyzers' developers. To this end, we conduct the first systematic study on a broad range of 350 historical issues of FNs/FPs from three popular static code analyzers (i.e., PMD, SpotBugs, and SonarQube). All these issues have been confirmed and fixed by the developers. We investigated these issues' root causes and the characteristics of the corresponding issue-triggering programs. It reveals several new interesting findings and implications on mitigating FNs and FPs. Furthermore, guided by some findings of our study, we designed a metamorphic testing strategy to find FNs and FPs. This strategy successfully found 14 new issues of FNs/FPs, 11 of which have been confirmed and 9 have already been fixed by the developers. Our further manual investigation of the studied analyzers revealed one rule specification issue and additional four FNs/FPs due to the weaknesses of the implemented static analysis. We have made all the artifacts (datasets and tools) publicly available at https://zenodo.org/doi/10.5281/zenodo.11525129.

Related papers

Verifying the Verifiers: Unveiling Pitfalls and Potentials in Fact Verifiers [59.168391398830515]
We evaluate 12 pre-trained LLMs and one specialized fact-verifier, using a collection of examples from 14 fact-checking benchmarks.<n>We highlight the importance of addressing annotation errors and ambiguity in datasets.<n> frontier LLMs with few-shot in-context examples, often overlooked in previous works, achieve top-tier performance.
arXiv Detail & Related papers (2025-06-16T10:32:10Z)
Data Fusion for Partial Identification of Causal Effects [62.56890808004615]
We propose a novel partial identification framework that enables researchers to answer key questions.<n>Is the causal effect positive or negative? and How severe must assumption violations be to overturn this conclusion?<n>We apply our framework to the Project STAR study, which investigates the effect of classroom size on students' third-grade standardized test performance.
arXiv Detail & Related papers (2025-05-30T07:13:01Z)
A Comparative Study of Fuzzers and Static Analysis Tools for Finding Memory Unsafety in C and C++ [24.60320701097142]
We present an empirical analysis of five static analyzers and 13 fuzzers, applied to over 100 known security vulnerabilities in C/C++ programs.<n>We find that both techniques discover different types of bugs, but there are clear winners for each.
arXiv Detail & Related papers (2025-05-28T07:22:29Z)
KNighter: Transforming Static Analysis with LLM-Synthesized Checkers [14.02595288424478]
KNighter generates high-precision checkers capable of detecting diverse bug patterns. To date, KNighter-synthesized checkers have discovered 92 new, critical, long-latent bugs in the Linux kernel.
arXiv Detail & Related papers (2025-03-12T02:30:19Z)
LLM-Safety Evaluations Lack Robustness [58.334290876531036]
We argue that current safety alignment research efforts for large language models are hindered by many intertwined sources of noise. We propose a set of guidelines for reducing noise and bias in evaluations of future attack and defense papers.
arXiv Detail & Related papers (2025-03-04T12:55:07Z)
Eliminating Position Bias of Language Models: A Mechanistic Approach [119.34143323054143]
Position bias has proven to be a prevalent issue of modern language models (LMs) Our mechanistic analysis attributes the position bias to two components employed in nearly all state-of-the-art LMs: causal attention and relative positional encodings. By eliminating position bias, models achieve better performance and reliability in downstream tasks, including LM-as-a-judge, retrieval-augmented QA, molecule generation, and math reasoning.
arXiv Detail & Related papers (2024-07-01T09:06:57Z)
Understanding and Detecting Annotation-Induced Faults of Static Analyzers [4.824956210843882]
This paper presents the first comprehensive study of annotation-induced faults (AIF) We analyzed 246 issues in six open-source and popular static analyzers (i.e., PMD, SpotBugs, CheckStyle, Infer, SonarQube, and Soot)
arXiv Detail & Related papers (2024-02-22T08:09:01Z)
How Dataflow Diagrams Impact Software Security Analysis: an Empirical Experiment [5.6169596483204085]
We present the findings of an empirical experiment conducted to investigate DFDs’ impact on the performance of analysts in a security analysis setting. We found that the participants performed significantly better in answering the analysis tasks correctly in the model-supported condition. We identified three open challenges of using DFDs for security analysis based on the insights gained in the experiment.
arXiv Detail & Related papers (2024-01-09T09:22:35Z)
E&V: Prompting Large Language Models to Perform Static Analysis by Pseudo-code Execution and Verification [7.745665775992235]
Large Language Models (LLMs) offer new capabilities for software engineering tasks. LLMs simulate the execution of pseudo-code, effectively conducting static analysis encoded in the pseudo-code with minimal human effort. E&V includes a verification process for pseudo-code execution without needing an external oracle.
arXiv Detail & Related papers (2023-12-13T19:31:00Z)
The Hitchhiker's Guide to Program Analysis: A Journey with Large Language Models [18.026567399243]
Large Language Models (LLMs) offer a promising alternative to static analysis. In this paper, we take a deep dive into the open space of LLM-assisted static analysis. We develop LLift, a fully automated framework that interfaces with both a static analysis tool and an LLM.
arXiv Detail & Related papers (2023-08-01T02:57:43Z)
Pre-trained Embeddings for Entity Resolution: An Experimental Analysis [Experiment, Analysis & Benchmark] [65.11858854040544]
We perform a thorough experimental analysis of 12 popular language models over 17 established benchmark datasets. First, we assess their vectorization overhead for converting all input entities into dense embeddings vectors. Second, we investigate their blocking performance, performing a detailed scalability analysis, and comparing them with the state-of-the-art deep learning-based blocking method. Third, we conclude with their relative performance for both supervised and unsupervised matching.
arXiv Detail & Related papers (2023-04-24T08:53:54Z)
Consistency Analysis of ChatGPT [65.268245109828]
This paper investigates the trustworthiness of ChatGPT and GPT-4 regarding logically consistent behaviour. Our findings suggest that while both models appear to show an enhanced language understanding and reasoning ability, they still frequently fall short of generating logically consistent predictions.
arXiv Detail & Related papers (2023-03-11T01:19:01Z)
An Empirical Study on Bug Severity Estimation using Source Code Metrics and Static Analysis [0.8621608193534838]
We study 3,358 buggy methods with different severity labels from 19 Java open-source projects. Results show that code metrics are useful in predicting buggy code, but they cannot estimate the severity level of the bugs. Our categorization shows that Security bugs have high severity in most cases while Edge/Boundary faults have low severity.
arXiv Detail & Related papers (2022-06-26T17:07:23Z)
Competency Problems: On Finding and Removing Artifacts in Language Data [50.09608320112584]
We argue that for complex language understanding tasks, all simple feature correlations are spurious. We theoretically analyze the difficulty of creating data for competency problems when human bias is taken into account.
arXiv Detail & Related papers (2021-04-17T21:34:10Z)
Sentiment Analysis Based on Deep Learning: A Comparative Study [69.09570726777817]
The study of public opinion can provide us with valuable information. The efficiency and accuracy of sentiment analysis is being hindered by the challenges encountered in natural language processing. This paper reviews the latest studies that have employed deep learning to solve sentiment analysis problems.
arXiv Detail & Related papers (2020-06-05T16:28:10Z)
The Curse of Performance Instability in Analysis Datasets: Consequences, Source, and Suggestions [93.62888099134028]
We find that the performance of state-of-the-art models on Natural Language Inference (NLI) and Reading (RC) analysis/stress sets can be highly unstable. This raises three questions: (1) How will the instability affect the reliability of the conclusions drawn based on these analysis sets? We give both theoretical explanations and empirical evidence regarding the source of the instability.
arXiv Detail & Related papers (2020-04-28T15:41:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.