REEF: A Framework for Collecting Real-World Vulnerabilities and Fixes
- URL: http://arxiv.org/abs/2309.08115v1
- Date: Fri, 15 Sep 2023 02:50:08 GMT
- Title: REEF: A Framework for Collecting Real-World Vulnerabilities and Fixes
- Authors: Chaozheng Wang, Zongjie Li, Yun Peng, Shuzheng Gao, Sirong Chen, Shuai
Wang, Cuiyun Gao, Michael R. Lyu
- Abstract summary: We propose an automated collecting framework REEF to collect REal-world vulnErabilities and Fixes from open-source repositories.
We develop a multi-language crawler to collect vulnerabilities and their fixes, and design metrics to filter for high-quality vulnerability-fix pairs.
Through extensive experiments, we demonstrate that our approach can collect high-quality vulnerability-fix pairs and generate strong explanations.
- Score: 40.401211102969356
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Software plays a crucial role in our daily lives, and therefore the quality
and security of software systems have become increasingly important. However,
vulnerabilities in software still pose a significant threat, as they can have
serious consequences. Recent advances in automated program repair have sought
to automatically detect and fix bugs using data-driven techniques.
Sophisticated deep learning methods have been applied to this area and have
achieved promising results. However, existing benchmarks for training and
evaluating these techniques remain limited, as they tend to focus on a single
programming language and have relatively small datasets. Moreover, many
benchmarks tend to be outdated and lack diversity, focusing on a specific
codebase. Worse still, the quality of bug explanations in existing datasets is
low, as they typically use imprecise and uninformative commit messages as
explanations.
To address these issues, we propose an automated collecting framework REEF to
collect REal-world vulnErabilities and Fixes from open-source repositories. We
develop a multi-language crawler to collect vulnerabilities and their fixes,
and design metrics to filter for high-quality vulnerability-fix pairs.
Furthermore, we propose a neural language model-based approach to generate
high-quality vulnerability explanations, which is key to producing informative
fix messages. Through extensive experiments, we demonstrate that our approach
can collect high-quality vulnerability-fix pairs and generate strong
explanations. The dataset we collect contains 4,466 CVEs with 30,987 patches
(including 236 CWE) across 7 programming languages with detailed related
information, which is superior to existing benchmarks in scale, coverage, and
quality. Evaluations by human experts further confirm that our framework
produces high-quality vulnerability explanations.
Related papers
- Data Quality Issues in Vulnerability Detection Datasets [1.6114012813668932]
Vulnerability detection is a crucial yet challenging task to identify potential weaknesses in software for cyber security.
Deep learning (DL) has made great progress in automating the detection process.
Many datasets have been created to train DL models for this purpose.
However, these datasets suffer from several issues that will lead to low detection accuracy of DL models.
arXiv Detail & Related papers (2024-10-08T13:31:29Z) - Enhancing Pre-Trained Language Models for Vulnerability Detection via Semantic-Preserving Data Augmentation [4.374800396968465]
We propose a data augmentation technique aimed at enhancing the performance of pre-trained language models for vulnerability detection.
By incorporating our augmented dataset in fine-tuning a series of representative code pre-trained models, up to 10.1% increase in accuracy and 23.6% increase in F1 can be achieved.
arXiv Detail & Related papers (2024-09-30T21:44:05Z) - Outside the Comfort Zone: Analysing LLM Capabilities in Software Vulnerability Detection [9.652886240532741]
This paper thoroughly analyses large language models' capabilities in detecting vulnerabilities within source code.
We evaluate the performance of six open-source models that are specifically trained for vulnerability detection against six general-purpose LLMs.
arXiv Detail & Related papers (2024-08-29T10:00:57Z) - AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language Models [95.09157454599605]
Large Language Models (LLMs) are becoming increasingly powerful, but they still exhibit significant but subtle weaknesses.
Traditional benchmarking approaches cannot thoroughly pinpoint specific model deficiencies.
We introduce a unified framework, AutoDetect, to automatically expose weaknesses in LLMs across various tasks.
arXiv Detail & Related papers (2024-06-24T15:16:45Z) - Vulnerability Detection with Code Language Models: How Far Are We? [40.455600722638906]
PrimeVul is a new dataset for training and evaluating code LMs for vulnerability detection.
It incorporates a novel set of data labeling techniques that achieve comparable label accuracy to human-verified benchmarks.
It also implements a rigorous data de-duplication and chronological data splitting strategy to mitigate data leakage issues.
arXiv Detail & Related papers (2024-03-27T14:34:29Z) - Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities [12.82645410161464]
We evaluate the effectiveness of 16 pre-trained Large Language Models on 5,000 code samples from five diverse security datasets.
Overall, LLMs show modest effectiveness in detecting vulnerabilities, obtaining an average accuracy of 62.8% and F1 score of 0.71 across datasets.
We find that advanced prompting strategies that involve step-by-step analysis significantly improve performance of LLMs on real-world datasets in terms of F1 score (by upto 0.18 on average)
arXiv Detail & Related papers (2023-11-16T13:17:20Z) - CodeLMSec Benchmark: Systematically Evaluating and Finding Security
Vulnerabilities in Black-Box Code Language Models [58.27254444280376]
Large language models (LLMs) for automatic code generation have achieved breakthroughs in several programming tasks.
Training data for these models is usually collected from the Internet (e.g., from open-source repositories) and is likely to contain faults and security vulnerabilities.
This unsanitized training data can cause the language models to learn these vulnerabilities and propagate them during the code generation procedure.
arXiv Detail & Related papers (2023-02-08T11:54:07Z) - VELVET: a noVel Ensemble Learning approach to automatically locate
VulnErable sTatements [62.93814803258067]
This paper presents VELVET, a novel ensemble learning approach to locate vulnerable statements in source code.
Our model combines graph-based and sequence-based neural networks to successfully capture the local and global context of a program graph.
VELVET achieves 99.6% and 43.6% top-1 accuracy over synthetic data and real-world data, respectively.
arXiv Detail & Related papers (2021-12-20T22:45:27Z) - Trustworthy AI [75.99046162669997]
Brittleness to minor adversarial changes in the input data, ability to explain the decisions, address the bias in their training data, are some of the most prominent limitations.
We propose the tutorial on Trustworthy AI to address six critical issues in enhancing user and public trust in AI systems.
arXiv Detail & Related papers (2020-11-02T20:04:18Z) - InfoBERT: Improving Robustness of Language Models from An Information
Theoretic Perspective [84.78604733927887]
Large-scale language models such as BERT have achieved state-of-the-art performance across a wide range of NLP tasks.
Recent studies show that such BERT-based models are vulnerable facing the threats of textual adversarial attacks.
We propose InfoBERT, a novel learning framework for robust fine-tuning of pre-trained language models.
arXiv Detail & Related papers (2020-10-05T20:49:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.