Detecting Security Fixes in Open-Source Repositories using Static Code
Analyzers
- URL: http://arxiv.org/abs/2105.03346v1
- Date: Fri, 7 May 2021 15:57:17 GMT
- Title: Detecting Security Fixes in Open-Source Repositories using Static Code
Analyzers
- Authors: Therese Fehrer, Roc\'io Cabrera Lozoya, Antonino Sabetta, Dario Di
Nucci, Damian A. Tamburri
- Abstract summary: We study the extent to which the output of off-the-shelf static code analyzers can be used as a source of features to represent commits in Machine Learning (ML) applications.
We investigate how such features can be used to construct embeddings and train ML models to automatically identify source code commits that contain vulnerability fixes.
We find that the combination of our method with commit2vec represents a tangible improvement over the state of the art in the automatic identification of commits that fix vulnerabilities.
- Score: 8.716427214870459
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The sources of reliable, code-level information about vulnerabilities that
affect open-source software (OSS) are scarce, which hinders a broad adoption of
advanced tools that provide code-level detection and assessment of vulnerable
OSS dependencies.
In this paper, we study the extent to which the output of off-the-shelf
static code analyzers can be used as a source of features to represent commits
in Machine Learning (ML) applications. In particular, we investigate how such
features can be used to construct embeddings and train ML models to
automatically identify source code commits that contain vulnerability fixes.
We analyze such embeddings for security-relevant and non-security-relevant
commits, and we show that, although in isolation they are not different in a
statistically significant manner, it is possible to use them to construct a ML
pipeline that achieves results comparable with the state of the art.
We also found that the combination of our method with commit2vec represents a
tangible improvement over the state of the art in the automatic identification
of commits that fix vulnerabilities: the ML models we construct and commit2vec
are complementary, the former being more generally applicable, albeit not as
accurate.
Related papers
- Exploring Automatic Cryptographic API Misuse Detection in the Era of LLMs [60.32717556756674]
This paper introduces a systematic evaluation framework to assess Large Language Models in detecting cryptographic misuses.
Our in-depth analysis of 11,940 LLM-generated reports highlights that the inherent instabilities in LLMs can lead to over half of the reports being false positives.
The optimized approach achieves a remarkable detection rate of nearly 90%, surpassing traditional methods and uncovering previously unknown misuses in established benchmarks.
arXiv Detail & Related papers (2024-07-23T15:31:26Z) - Harnessing Large Language Models for Software Vulnerability Detection: A Comprehensive Benchmarking Study [1.03590082373586]
We propose using large language models (LLMs) to assist in finding vulnerabilities in source code.
The aim is to test multiple state-of-the-art LLMs and identify the best prompting strategies.
We find that LLMs can pinpoint many more issues than traditional static analysis tools, outperforming traditional tools in terms of recall and F1 scores.
arXiv Detail & Related papers (2024-05-24T14:59:19Z) - Software Vulnerability and Functionality Assessment using LLMs [0.8057006406834466]
We investigate whether Large Language Models (LLMs) can aid with code reviews.
Our investigation focuses on two tasks that we argue are fundamental to good reviews.
arXiv Detail & Related papers (2024-03-13T11:29:13Z) - Detectors for Safe and Reliable LLMs: Implementations, Uses, and Limitations [76.19419888353586]
Large language models (LLMs) are susceptible to a variety of risks, from non-faithful output to biased and toxic generations.
We present our efforts to create and deploy a library of detectors: compact and easy-to-build classification models that provide labels for various harms.
arXiv Detail & Related papers (2024-03-09T21:07:16Z) - ReposVul: A Repository-Level High-Quality Vulnerability Dataset [13.90550557801464]
We propose an automated data collection framework and construct the first repository-level high-quality vulnerability dataset named ReposVul.
The proposed framework mainly contains three modules: (1) A vulnerability untangling module, aiming at distinguishing vulnerability-fixing related code changes from tangled patches, in which the Large Language Models (LLMs) and static analysis tools are jointly employed, (2) A multi-granularity dependency extraction module, aiming at capturing the inter-procedural call relationships of vulnerabilities, in which we construct multiple-granularity information for each vulnerability patch, including repository-level, file-level, function-level
arXiv Detail & Related papers (2024-01-24T01:27:48Z) - VELVET: a noVel Ensemble Learning approach to automatically locate
VulnErable sTatements [62.93814803258067]
This paper presents VELVET, a novel ensemble learning approach to locate vulnerable statements in source code.
Our model combines graph-based and sequence-based neural networks to successfully capture the local and global context of a program graph.
VELVET achieves 99.6% and 43.6% top-1 accuracy over synthetic data and real-world data, respectively.
arXiv Detail & Related papers (2021-12-20T22:45:27Z) - Creating Training Sets via Weak Indirect Supervision [66.77795318313372]
Weak Supervision (WS) frameworks synthesize training labels from multiple potentially noisy supervision sources.
We formulate Weak Indirect Supervision (WIS), a new research problem for automatically synthesizing training labels.
We develop a probabilistic modeling approach, PLRM, which uses user-provided label relations to model and leverage indirect supervision sources.
arXiv Detail & Related papers (2021-10-07T14:09:35Z) - RoFL: Attestable Robustness for Secure Federated Learning [59.63865074749391]
Federated Learning allows a large number of clients to train a joint model without the need to share their private data.
To ensure the confidentiality of the client updates, Federated Learning systems employ secure aggregation.
We present RoFL, a secure Federated Learning system that improves robustness against malicious clients.
arXiv Detail & Related papers (2021-07-07T15:42:49Z) - Multi-context Attention Fusion Neural Network for Software Vulnerability
Identification [4.05739885420409]
We propose a deep learning model that learns to detect some of the common categories of security vulnerabilities in source code efficiently.
The model builds an accurate understanding of code semantics with a lot less learnable parameters.
The proposed AI achieves 98.40% F1-score on specific CWEs from the benchmarked NIST SARD dataset.
arXiv Detail & Related papers (2021-04-19T11:50:36Z) - Automated Mapping of Vulnerability Advisories onto their Fix Commits in
Open Source Repositories [7.629717457706326]
We present an approach that combines practical experience and machine-learning (ML)
An advisory record containing key information about a vulnerability is extracted from an advisory.
A subset of candidate fix commits is obtained from the source code repository of the affected project.
arXiv Detail & Related papers (2021-03-24T17:50:35Z) - D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using
Differential Analysis [55.15995704119158]
We propose D2A, a differential analysis based approach to label issues reported by static analysis tools.
We use D2A to generate a large labeled dataset to train models for vulnerability identification.
arXiv Detail & Related papers (2021-02-16T07:46:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.