D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using
Differential Analysis
- URL: http://arxiv.org/abs/2102.07995v1
- Date: Tue, 16 Feb 2021 07:46:53 GMT
- Title: D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using
Differential Analysis
- Authors: Yunhui Zheng, Saurabh Pujar, Burn Lewis, Luca Buratti, Edward Epstein,
Bo Yang, Jim Laredo, Alessandro Morari, Zhong Su
- Abstract summary: We propose D2A, a differential analysis based approach to label issues reported by static analysis tools.
We use D2A to generate a large labeled dataset to train models for vulnerability identification.
- Score: 55.15995704119158
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Static analysis tools are widely used for vulnerability detection as they
understand programs with complex behavior and millions of lines of code.
Despite their popularity, static analysis tools are known to generate an excess
of false positives. The recent ability of Machine Learning models to understand
programming languages opens new possibilities when applied to static analysis.
However, existing datasets to train models for vulnerability identification
suffer from multiple limitations such as limited bug context, limited size, and
synthetic and unrealistic source code. We propose D2A, a differential analysis
based approach to label issues reported by static analysis tools. The D2A
dataset is built by analyzing version pairs from multiple open source projects.
From each project, we select bug fixing commits and we run static analysis on
the versions before and after such commits. If some issues detected in a
before-commit version disappear in the corresponding after-commit version, they
are very likely to be real bugs that got fixed by the commit. We use D2A to
generate a large labeled dataset to train models for vulnerability
identification. We show that the dataset can be used to build a classifier to
identify possible false alarms among the issues reported by static analysis,
hence helping developers prioritize and investigate potential true positives
first.
Related papers
- Bayesian Detector Combination for Object Detection with Crowdsourced Annotations [49.43709660948812]
Acquiring fine-grained object detection annotations in unconstrained images is time-consuming, expensive, and prone to noise.
We propose a novel Bayesian Detector Combination (BDC) framework to more effectively train object detectors with noisy crowdsourced annotations.
BDC is model-agnostic, requires no prior knowledge of the annotators' skill level, and seamlessly integrates with existing object detection models.
arXiv Detail & Related papers (2024-07-10T18:00:54Z) - The Hitchhiker's Guide to Program Analysis: A Journey with Large
Language Models [18.026567399243]
Large Language Models (LLMs) offer a promising alternative to static analysis.
In this paper, we take a deep dive into the open space of LLM-assisted static analysis.
We develop LLift, a fully automated framework that interfaces with both a static analysis tool and an LLM.
arXiv Detail & Related papers (2023-08-01T02:57:43Z) - Cross Version Defect Prediction with Class Dependency Embeddings [17.110933073074584]
We use the Class Dependency Network (CDN) as another predictor for defects, combined with static code metrics.
Our approach uses network embedding techniques to leverage CDN information without having to build the metrics manually.
arXiv Detail & Related papers (2022-12-29T18:24:39Z) - GLENet: Boosting 3D Object Detectors with Generative Label Uncertainty Estimation [70.75100533512021]
In this paper, we formulate the label uncertainty problem as the diversity of potentially plausible bounding boxes of objects.
We propose GLENet, a generative framework adapted from conditional variational autoencoders, to model the one-to-many relationship between a typical 3D object and its potential ground-truth bounding boxes with latent variables.
The label uncertainty generated by GLENet is a plug-and-play module and can be conveniently integrated into existing deep 3D detectors.
arXiv Detail & Related papers (2022-07-06T06:26:17Z) - Learning to Reduce False Positives in Analytic Bug Detectors [12.733531603080674]
We propose a Transformer-based learning approach to identify false positive bug warnings.
We demonstrate that our models can improve the precision of static analysis by 17.5%.
arXiv Detail & Related papers (2022-03-08T04:26:26Z) - Detecting Security Fixes in Open-Source Repositories using Static Code
Analyzers [8.716427214870459]
We study the extent to which the output of off-the-shelf static code analyzers can be used as a source of features to represent commits in Machine Learning (ML) applications.
We investigate how such features can be used to construct embeddings and train ML models to automatically identify source code commits that contain vulnerability fixes.
We find that the combination of our method with commit2vec represents a tangible improvement over the state of the art in the automatic identification of commits that fix vulnerabilities.
arXiv Detail & Related papers (2021-05-07T15:57:17Z) - Assessing Validity of Static Analysis Warnings using Ensemble Learning [4.05739885420409]
Static Analysis (SA) tools are used to identify potential weaknesses in code and fix them in advance, while the code is being developed.
These rules-based static analysis tools generally report a lot of false warnings along with the actual ones.
We propose a Machine Learning (ML)-based learning process that uses source codes, historic commit data, and classifier-ensembles to prioritize the True warnings.
arXiv Detail & Related papers (2021-04-21T19:39:20Z) - Double Perturbation: On the Robustness of Robustness and Counterfactual
Bias Evaluation [109.06060143938052]
We propose a "double perturbation" framework to uncover model weaknesses beyond the test dataset.
We apply this framework to study two perturbation-based approaches that are used to analyze models' robustness and counterfactual bias in English.
arXiv Detail & Related papers (2021-04-12T06:57:36Z) - Robust and Transferable Anomaly Detection in Log Data using Pre-Trained
Language Models [59.04636530383049]
Anomalies or failures in large computer systems, such as the cloud, have an impact on a large number of users.
We propose a framework for anomaly detection in log data, as a major troubleshooting source of system information.
arXiv Detail & Related papers (2021-02-23T09:17:05Z) - Stance Detection Benchmark: How Robust Is Your Stance Detection? [65.91772010586605]
Stance Detection (StD) aims to detect an author's stance towards a certain topic or claim.
We introduce a StD benchmark that learns from ten StD datasets of various domains in a multi-dataset learning setting.
Within this benchmark setup, we are able to present new state-of-the-art results on five of the datasets.
arXiv Detail & Related papers (2020-01-06T13:37:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.