Related papers: ReposVul: A Repository-Level High-Quality Vulnerability Dataset

ReposVul: A Repository-Level High-Quality Vulnerability Dataset

URL: http://arxiv.org/abs/2401.13169v2
Date: Thu, 8 Feb 2024 05:06:47 GMT
Title: ReposVul: A Repository-Level High-Quality Vulnerability Dataset
Authors: Xinchen Wang, Ruida Hu, Cuiyun Gao, Xin-Cheng Wen, Yujia Chen and Qing Liao
Abstract summary: We propose an automated data collection framework and construct the first repository-level high-quality vulnerability dataset named ReposVul. The proposed framework mainly contains three modules: (1) A vulnerability untangling module, aiming at distinguishing vulnerability-fixing related code changes from tangled patches, in which the Large Language Models (LLMs) and static analysis tools are jointly employed, (2) A multi-granularity dependency extraction module, aiming at capturing the inter-procedural call relationships of vulnerabilities, in which we construct multiple-granularity information for each vulnerability patch, including repository-level, file-level, function-level
Score: 13.90550557801464
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Open-Source Software (OSS) vulnerabilities bring great challenges to the software security and pose potential risks to our society. Enormous efforts have been devoted into automated vulnerability detection, among which deep learning (DL)-based approaches have proven to be the most effective. However, the current labeled data present the following limitations: (1) Tangled Patches: Developers may submit code changes unrelated to vulnerability fixes within patches, leading to tangled patches. (2) Lacking Inter-procedural Vulnerabilities: The existing vulnerability datasets typically contain function-level and file-level vulnerabilities, ignoring the relations between functions, thus rendering the approaches unable to detect the inter-procedural vulnerabilities. (3) Outdated Patches: The existing datasets usually contain outdated patches, which may bias the model during training. To address the above limitations, in this paper, we propose an automated data collection framework and construct the first repository-level high-quality vulnerability dataset named ReposVul. The proposed framework mainly contains three modules: (1) A vulnerability untangling module, aiming at distinguishing vulnerability-fixing related code changes from tangled patches, in which the Large Language Models (LLMs) and static analysis tools are jointly employed. (2) A multi-granularity dependency extraction module, aiming at capturing the inter-procedural call relationships of vulnerabilities, in which we construct multiple-granularity information for each vulnerability patch, including repository-level, file-level, function-level, and line-level. (3) A trace-based filtering module, aiming at filtering the outdated patches, which leverages the file path trace-based filter and commit time trace-based filter to construct an up-to-date dataset.

Related papers

Improving the Context Length and Efficiency of Code Retrieval for Tracing Security Vulnerability Fixes [1.3606495556399092]
A critical task in vulnerability management is tracing the patches that fix a vulnerability. Previous work has shown that the patch information is often missing in vulnerability databases. We propose SITPatchTracer, a scalable full-repo full-context retrieval system.
arXiv Detail & Related papers (2025-03-29T01:53:07Z)
Lie Detector: Unified Backdoor Detection via Cross-Examination Framework [68.45399098884364]
We propose a unified backdoor detection framework in the semi-honest setting. Our method achieves superior detection performance, improving accuracy by 5.4%, 1.6%, and 11.9% over SoTA baselines. Notably, it is the first to effectively detect backdoors in multimodal large language models.
arXiv Detail & Related papers (2025-03-21T06:12:06Z)
ReF Decompile: Relabeling and Function Call Enhanced Decompile [50.86228893636785]
The goal of decompilation is to convert compiled low-level code (e.g., assembly code) back into high-level programming languages. This task supports various reverse engineering applications, such as vulnerability identification, malware analysis, and legacy software migration.
arXiv Detail & Related papers (2025-02-17T12:38:57Z)
VulRG: Multi-Level Explainable Vulnerability Patch Ranking for Complex Systems Using Graphs [20.407534993667607]
This work introduces a graph-based framework for vulnerability patch prioritization. It integrates diverse data sources and metrics into a universally applicable model. refined risk metrics enable detailed assessments at the component, asset, and system levels.
arXiv Detail & Related papers (2025-02-16T14:21:52Z)
Repository-Level Graph Representation Learning for Enhanced Security Patch Detection [22.039868029497942]
This paper proposes a Repository-level Security Patch Detection framework named RepoSPD. RepoSPD comprises three key components: 1) a repository-level graph construction, RepoCPG, which represents software patches by merging pre-patch and post-patch source code at the repository level; 2) a structure-aware patch representation, which fuses the graph and sequence branch and aims at comprehending the relationship among multiple code changes; and 3) progressive learning, which facilitates the model in balancing semantic and structural information.
arXiv Detail & Related papers (2024-12-11T03:29:56Z)
Learning Graph-based Patch Representations for Identifying and Assessing Silent Vulnerability Fixes [5.983725940750908]
Software projects are dependent on many third-party libraries, therefore high-risk vulnerabilities can propagate through the dependency chain to downstream projects. Silent vulnerability fixes cause downstream software to be unaware of urgent security issues in a timely manner, posing a security risk to the software. We propose GRAPE, a GRAph-based Patch rEpresentation that aims to provide a unified framework for getting vulnerability fix patches representation.
arXiv Detail & Related papers (2024-09-13T03:23:11Z)
LLM-Enhanced Static Analysis for Precise Identification of Vulnerable OSS Versions [12.706661324384319]
Open-source software (OSS) has experienced a surge in popularity, attributed to its collaborative development model and cost-effective nature. The adoption of specific software versions in development projects may introduce security risks when these versions bring along vulnerabilities. Current methods of identifying vulnerable versions typically analyze and trace the code involved in vulnerability patches using static analysis with pre-defined rules. This paper presents Vercation, an approach designed to identify vulnerable versions of OSS written in C/C++.
arXiv Detail & Related papers (2024-08-14T06:43:06Z)
PriRoAgg: Achieving Robust Model Aggregation with Minimum Privacy Leakage for Federated Learning [49.916365792036636]
Federated learning (FL) has recently gained significant momentum due to its potential to leverage large-scale distributed user data. The transmitted model updates can potentially leak sensitive user information, and the lack of central control of the local training process leaves the global model susceptible to malicious manipulations on model updates. We develop a general framework PriRoAgg, utilizing Lagrange coded computing and distributed zero-knowledge proof, to execute a wide range of robust aggregation algorithms while satisfying aggregated privacy.
arXiv Detail & Related papers (2024-07-12T03:18:08Z)
VulEval: Towards Repository-Level Evaluation of Software Vulnerability Detection [14.312197590230994]
repository-level evaluation system named textbfVulEval aims at evaluating the detection performance of inter- and intra-procedural vulnerabilities simultaneously. VulEval consists of a large-scale dataset, with a total of 4,196 CVE entries, 232,239 functions, and corresponding 4,699 repository-level source code in C/C++ programming languages.
arXiv Detail & Related papers (2024-04-24T02:16:11Z)
Just-in-Time Detection of Silent Security Patches [7.840762542485285]
Security patches can be em silent, i.e., they do not always come with comprehensive advisories such as CVEs. This lack of transparency leaves users oblivious to available security updates, providing ample opportunity for attackers to exploit unpatched vulnerabilities. We propose to leverage large language models (LLMs) to augment patch information with generated code change explanations.
arXiv Detail & Related papers (2023-12-02T22:53:26Z)
REEF: A Framework for Collecting Real-World Vulnerabilities and Fixes [40.401211102969356]
We propose an automated collecting framework REEF to collect REal-world vulnErabilities and Fixes from open-source repositories. We develop a multi-language crawler to collect vulnerabilities and their fixes, and design metrics to filter for high-quality vulnerability-fix pairs. Through extensive experiments, we demonstrate that our approach can collect high-quality vulnerability-fix pairs and generate strong explanations.
arXiv Detail & Related papers (2023-09-15T02:50:08Z)
DeepfakeBench: A Comprehensive Benchmark of Deepfake Detection [55.70982767084996]
A critical yet frequently overlooked challenge in the field of deepfake detection is the lack of a standardized, unified, comprehensive benchmark. We present the first comprehensive benchmark for deepfake detection, called DeepfakeBench, which offers three key contributions. DeepfakeBench contains 15 state-of-the-art detection methods, 9CL datasets, a series of deepfake detection evaluation protocols and analysis tools, as well as comprehensive evaluations.
arXiv Detail & Related papers (2023-07-04T01:34:41Z)
CodeLMSec Benchmark: Systematically Evaluating and Finding Security Vulnerabilities in Black-Box Code Language Models [58.27254444280376]
Large language models (LLMs) for automatic code generation have achieved breakthroughs in several programming tasks. Training data for these models is usually collected from the Internet (e.g., from open-source repositories) and is likely to contain faults and security vulnerabilities. This unsanitized training data can cause the language models to learn these vulnerabilities and propagate them during the code generation procedure.
arXiv Detail & Related papers (2023-02-08T11:54:07Z)
Reference Twice: A Simple and Unified Baseline for Few-Shot Instance Segmentation [103.90033029330527]
Few-Shot Instance (FSIS) requires detecting and segmenting novel classes with limited support examples. We introduce a unified framework, Reference Twice (RefT), to exploit the relationship between support and query features for FSIS.
arXiv Detail & Related papers (2023-01-03T15:33:48Z)
Defensive Patches for Robust Recognition in the Physical World [111.46724655123813]
Data-end defense improves robustness by operations on input data instead of modifying models. Previous data-end defenses show low generalization against diverse noises and weak transferability across multiple models. We propose a defensive patch generation framework to address these problems by helping models better exploit these features.
arXiv Detail & Related papers (2022-04-13T07:34:51Z)
VELVET: a noVel Ensemble Learning approach to automatically locate VulnErable sTatements [62.93814803258067]
This paper presents VELVET, a novel ensemble learning approach to locate vulnerable statements in source code. Our model combines graph-based and sequence-based neural networks to successfully capture the local and global context of a program graph. VELVET achieves 99.6% and 43.6% top-1 accuracy over synthetic data and real-world data, respectively.
arXiv Detail & Related papers (2021-12-20T22:45:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.