REEF: A Framework for Collecting Real-World Vulnerabilities and Fixes
- URL: http://arxiv.org/abs/2309.08115v1
- Date: Fri, 15 Sep 2023 02:50:08 GMT
- Title: REEF: A Framework for Collecting Real-World Vulnerabilities and Fixes
- Authors: Chaozheng Wang, Zongjie Li, Yun Peng, Shuzheng Gao, Sirong Chen, Shuai
Wang, Cuiyun Gao, Michael R. Lyu
- Abstract summary: We propose an automated collecting framework REEF to collect REal-world vulnErabilities and Fixes from open-source repositories.
We develop a multi-language crawler to collect vulnerabilities and their fixes, and design metrics to filter for high-quality vulnerability-fix pairs.
Through extensive experiments, we demonstrate that our approach can collect high-quality vulnerability-fix pairs and generate strong explanations.
- Score: 40.401211102969356
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Software plays a crucial role in our daily lives, and therefore the quality
and security of software systems have become increasingly important. However,
vulnerabilities in software still pose a significant threat, as they can have
serious consequences. Recent advances in automated program repair have sought
to automatically detect and fix bugs using data-driven techniques.
Sophisticated deep learning methods have been applied to this area and have
achieved promising results. However, existing benchmarks for training and
evaluating these techniques remain limited, as they tend to focus on a single
programming language and have relatively small datasets. Moreover, many
benchmarks tend to be outdated and lack diversity, focusing on a specific
codebase. Worse still, the quality of bug explanations in existing datasets is
low, as they typically use imprecise and uninformative commit messages as
explanations.
To address these issues, we propose an automated collecting framework REEF to
collect REal-world vulnErabilities and Fixes from open-source repositories. We
develop a multi-language crawler to collect vulnerabilities and their fixes,
and design metrics to filter for high-quality vulnerability-fix pairs.
Furthermore, we propose a neural language model-based approach to generate
high-quality vulnerability explanations, which is key to producing informative
fix messages. Through extensive experiments, we demonstrate that our approach
can collect high-quality vulnerability-fix pairs and generate strong
explanations. The dataset we collect contains 4,466 CVEs with 30,987 patches
(including 236 CWE) across 7 programming languages with detailed related
information, which is superior to existing benchmarks in scale, coverage, and
quality. Evaluations by human experts further confirm that our framework
produces high-quality vulnerability explanations.
Related papers
- AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language Models [95.09157454599605]
Large Language Models (LLMs) are becoming increasingly powerful, but they still exhibit significant but subtle weaknesses.
Traditional benchmarking approaches cannot thoroughly pinpoint specific model deficiencies.
We introduce a unified framework, AutoDetect, to automatically expose weaknesses in LLMs across various tasks.
arXiv Detail & Related papers (2024-06-24T15:16:45Z) - VulZoo: A Comprehensive Vulnerability Intelligence Dataset [12.229092589037808]
VulZoo is a comprehensive vulnerability intelligence dataset that covers 17 popular vulnerability information sources.
We make VulZoo publicly available and maintain it with incremental updates to facilitate future research.
arXiv Detail & Related papers (2024-06-24T06:39:07Z) - Vulnerability Detection with Code Language Models: How Far Are We? [40.455600722638906]
PrimeVul is a new dataset for training and evaluating code LMs for vulnerability detection.
It incorporates a novel set of data labeling techniques that achieve comparable label accuracy to human-verified benchmarks.
It also implements a rigorous data de-duplication and chronological data splitting strategy to mitigate data leakage issues.
arXiv Detail & Related papers (2024-03-27T14:34:29Z) - Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities [12.82645410161464]
Large Language Models (LLMs) have demonstrated remarkable performance on code-related tasks.
We evaluate whether pre-trained LLMs can detect security vulnerabilities and address the limitations of existing tools.
arXiv Detail & Related papers (2023-11-16T13:17:20Z) - DefectHunter: A Novel LLM-Driven Boosted-Conformer-based Code Vulnerability Detection Mechanism [3.9377491512285157]
DefectHunter is an innovative model for vulnerability identification that employs the Conformer mechanism.
This mechanism fuses self-attention with convolutional networks to capture both local, position-wise features and global, content-based interactions.
arXiv Detail & Related papers (2023-09-27T00:10:29Z) - CodeLMSec Benchmark: Systematically Evaluating and Finding Security
Vulnerabilities in Black-Box Code Language Models [58.27254444280376]
Large language models (LLMs) for automatic code generation have achieved breakthroughs in several programming tasks.
Training data for these models is usually collected from the Internet (e.g., from open-source repositories) and is likely to contain faults and security vulnerabilities.
This unsanitized training data can cause the language models to learn these vulnerabilities and propagate them during the code generation procedure.
arXiv Detail & Related papers (2023-02-08T11:54:07Z) - VELVET: a noVel Ensemble Learning approach to automatically locate
VulnErable sTatements [62.93814803258067]
This paper presents VELVET, a novel ensemble learning approach to locate vulnerable statements in source code.
Our model combines graph-based and sequence-based neural networks to successfully capture the local and global context of a program graph.
VELVET achieves 99.6% and 43.6% top-1 accuracy over synthetic data and real-world data, respectively.
arXiv Detail & Related papers (2021-12-20T22:45:27Z) - V2W-BERT: A Framework for Effective Hierarchical Multiclass
Classification of Software Vulnerabilities [7.906207218788341]
We present a novel Transformer-based learning framework (V2W-BERT) in this paper.
By using ideas from natural language processing, link prediction and transfer learning, our method outperforms previous approaches.
We achieve up to 97% prediction accuracy for randomly partitioned data and up to 94% prediction accuracy in temporally partitioned data.
arXiv Detail & Related papers (2021-02-23T05:16:57Z) - Trustworthy AI [75.99046162669997]
Brittleness to minor adversarial changes in the input data, ability to explain the decisions, address the bias in their training data, are some of the most prominent limitations.
We propose the tutorial on Trustworthy AI to address six critical issues in enhancing user and public trust in AI systems.
arXiv Detail & Related papers (2020-11-02T20:04:18Z) - InfoBERT: Improving Robustness of Language Models from An Information
Theoretic Perspective [84.78604733927887]
Large-scale language models such as BERT have achieved state-of-the-art performance across a wide range of NLP tasks.
Recent studies show that such BERT-based models are vulnerable facing the threats of textual adversarial attacks.
We propose InfoBERT, a novel learning framework for robust fine-tuning of pre-trained language models.
arXiv Detail & Related papers (2020-10-05T20:49:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.