SkipAnalyzer: A Tool for Static Code Analysis with Large Language Models
- URL: http://arxiv.org/abs/2310.18532v2
- Date: Mon, 18 Dec 2023 02:23:19 GMT
- Title: SkipAnalyzer: A Tool for Static Code Analysis with Large Language Models
- Authors: Mohammad Mahdi Mohajer, Reem Aleithan, Nima Shiri Harzevili, Moshi
Wei, Alvine Boaye Belle, Hung Viet Pham, Song Wang
- Abstract summary: SkipAnalyzer is a large language model (LLM)-powered tool for static code analysis.
As a proof-of-concept, SkipAnalyzer is built on ChatGPT, which has exhibited outstanding performance in various software engineering tasks.
- Score: 12.21559364043576
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We introduce SkipAnalyzer, a large language model (LLM)-powered tool for
static code analysis. SkipAnalyzer has three components: 1) an LLM-based static
bug detector that scans source code and reports specific types of bugs, 2) an
LLM-based false-positive filter that can identify false-positive bugs in the
results of static bug detectors (e.g., the result of step 1) to improve
detection accuracy, and 3) an LLM-based patch generator that can generate
patches for the detected bugs above. As a proof-of-concept, SkipAnalyzer is
built on ChatGPT, which has exhibited outstanding performance in various
software engineering tasks. To evaluate SkipAnalyzer, we focus on two types of
typical and critical bugs that are targeted by static bug detection, i.e., Null
Dereference and Resource Leak as subjects. We employ Infer to aid the gathering
of these two bug types from 10 open-source projects. Consequently, our
experiment dataset contains 222 instances of Null Dereference bugs and 46
instances of Resource Leak bugs. Our study demonstrates that SkipAnalyzer
achieves remarkable performance in the mentioned static analysis tasks,
including bug detection, false-positive warning removal, and bug repair. In
static bug detection, SkipAnalyzer achieves accuracy values of up to 68.37% for
detecting Null Dereference bugs and 76.95% for detecting Resource Leak bugs,
improving the precision of the current leading bug detector, Infer, by 12.86%
and 43.13%, respectively. For removing false-positive warnings, SkipAnalyzer
can reach a precision of up to 93.88% for Null Dereference bugs and 63.33% for
Resource Leak bugs. Additionally, SkipAnalyzer surpasses state-of-the-art
false-positive warning removal tools. Furthermore, in bug repair, SkipAnalyzer
can generate syntactically correct patches to fix its detected bugs with a
success rate of up to 97.30%.
Related papers
- Leveraging Stack Traces for Spectrum-based Fault Localization in the Absence of Failing Tests [44.13331329339185]
We introduce a new approach, SBEST, that integrates stack trace data with test coverage to enhance fault localization.
Our approach shows a significant improvement, increasing Mean Average Precision (MAP) by 32.22% and Mean Reciprocal Rank (MRR) by 17.43% over traditional stack trace ranking methods.
arXiv Detail & Related papers (2024-05-01T15:15:52Z) - GPT-HateCheck: Can LLMs Write Better Functional Tests for Hate Speech Detection? [50.53312866647302]
HateCheck is a suite for testing fine-grained model functionalities on synthesized data.
We propose GPT-HateCheck, a framework to generate more diverse and realistic functional tests from scratch.
Crowd-sourced annotation demonstrates that the generated test cases are of high quality.
arXiv Detail & Related papers (2024-02-23T10:02:01Z) - DebugBench: Evaluating Debugging Capability of Large Language Models [80.73121177868357]
DebugBench is a benchmark for Large Language Models (LLMs)
It covers four major bug categories and 18 minor types in C++, Java, and Python.
We evaluate two commercial and four open-source models in a zero-shot scenario.
arXiv Detail & Related papers (2024-01-09T15:46:38Z) - Auto-labelling of Bug Report using Natural Language Processing [0.0]
Rule and Query-based solutions recommend a long list of potential similar bug reports with no clear ranking.
In this paper, we have proposed a solution using a combination of NLP techniques.
It uses a custom data transformer, a deep neural network, and a non-generalizing machine learning method to retrieve existing identical bug reports.
arXiv Detail & Related papers (2022-12-13T02:32:42Z) - Infrared: A Meta Bug Detector [10.541969253100815]
We propose a new approach, called meta bug detection, which offers three crucial advantages over existing learning-based bug detectors.
Our evaluation shows our meta bug detector (MBD) is effective in catching a variety of bugs including null pointer dereference, array index out-of-bound, file handle leak, and even data races in concurrent programs.
arXiv Detail & Related papers (2022-09-18T09:08:51Z) - An Empirical Study on Bug Severity Estimation using Source Code Metrics and Static Analysis [0.8621608193534838]
We study 3,358 buggy methods with different severity labels from 19 Java open-source projects.
Results show that code metrics are useful in predicting buggy code, but they cannot estimate the severity level of the bugs.
Our categorization shows that Security bugs have high severity in most cases while Edge/Boundary faults have low severity.
arXiv Detail & Related papers (2022-06-26T17:07:23Z) - Learning to Reduce False Positives in Analytic Bug Detectors [12.733531603080674]
We propose a Transformer-based learning approach to identify false positive bug warnings.
We demonstrate that our models can improve the precision of static analysis by 17.5%.
arXiv Detail & Related papers (2022-03-08T04:26:26Z) - D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using
Differential Analysis [55.15995704119158]
We propose D2A, a differential analysis based approach to label issues reported by static analysis tools.
We use D2A to generate a large labeled dataset to train models for vulnerability identification.
arXiv Detail & Related papers (2021-02-16T07:46:53Z) - Beyond Accuracy: Behavioral Testing of NLP models with CheckList [66.42971817954806]
CheckList is a task-agnostic methodology for testing NLP models.
CheckList includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation.
In a user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.
arXiv Detail & Related papers (2020-05-08T15:48:31Z) - TACRED Revisited: A Thorough Evaluation of the TACRED Relation
Extraction Task [80.38130122127882]
TACRED is one of the largest, most widely used crowdsourced datasets in Relation Extraction (RE)
In this paper, we investigate the questions: Have we reached a performance ceiling or is there still room for improvement?
We find that label errors account for 8% absolute F1 test error, and that more than 50% of the examples need to be relabeled.
arXiv Detail & Related papers (2020-04-30T15:07:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.