Improving the Context Length and Efficiency of Code Retrieval for Tracing Security Vulnerability Fixes
- URL: http://arxiv.org/abs/2503.22935v1
- Date: Sat, 29 Mar 2025 01:53:07 GMT
- Title: Improving the Context Length and Efficiency of Code Retrieval for Tracing Security Vulnerability Fixes
- Authors: Xueqing Liu, Jiangrui Zheng, Guanqun Yang, Siyan Wen, Qiushi Liu,
- Abstract summary: A critical task in vulnerability management is tracing the patches that fix a vulnerability.<n>Previous work has shown that the patch information is often missing in vulnerability databases.<n>We propose SITPatchTracer, a scalable full-repo full-context retrieval system.
- Score: 1.3606495556399092
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, the rapid increase of security vulnerabilities has caused major challenges in managing them. One critical task in vulnerability management is tracing the patches that fix a vulnerability. By accurately tracing the patching commits, security stakeholders can precisely identify affected software components, determine vulnerable and fixed versions, assess the severity etc., which facilitates rapid deployment of mitigations. However, previous work has shown that the patch information is often missing in vulnerability databases, including both the National Vulnerability Databases (NVD) and the GitHub Advisory Database, which increases the risk of delayed mitigation, incorrect vulnerability assessment, and potential exploits. Although existing work has proposed several approaches for patch tracing, they suffer from two major challenges: (1) the lack of scalability to the full-repository level, and (2) the lack of study on how to model the semantic similarity between the CVE and the full diff code. Upon identifying this gap, we propose SITPatchTracer, a scalable full-repo full-context retrieval system for security vulnerability patch tracing. SITPatchTracer leverages ElasticSearch, learning-to-rank, and a hierarchical embedding approach based on GritLM, a top-ranked LLM for text embedding with unlimited context length and fast inference speed. The evaluation of SITPatchTracer shows that it achieves a high recall on both evaluated datasets. SITPatchTracer's recall not only outperforms several existing works (PatchFinder, PatchScout, VFCFinder), but also Voyage, the SOTA commercial code embedding API by 13\% and 28\%.
Related papers
- How Robust Are Router-LLMs? Analysis of the Fragility of LLM Routing Capabilities [62.474732677086855]
Large language model (LLM) routing has emerged as a crucial strategy for balancing computational costs with performance.
We propose the DSC benchmark: Diverse, Simple, and Categorized, an evaluation framework that categorizes router performance across a broad spectrum of query types.
arXiv Detail & Related papers (2025-03-20T19:52:30Z) - CommitShield: Tracking Vulnerability Introduction and Fix in Version Control Systems [15.037460085046806]
CommitShield is a tool for detecting vulnerabilities in code commits.
It combines the code analysis capabilities of static analysis tools with the natural language and code understanding capabilities of large language models.
We show that CommitShield improves recall by 76%-87% over state-of-the-art methods in the vulnerability fix detection task.
arXiv Detail & Related papers (2025-01-07T08:52:55Z) - Improving Discovery of Known Software Vulnerability For Enhanced Cybersecurity [0.0]
Vulnerability detection relies on standardized identifiers such as Common Platformion (CPE) strings.
Non-standardized CPE strings issued by software vendors create a significant challenge.
Inconsistent naming conventions, and versioning practices lead to mismatches when querying databases.
arXiv Detail & Related papers (2024-12-21T12:43:52Z) - Learning Graph-based Patch Representations for Identifying and Assessing Silent Vulnerability Fixes [5.983725940750908]
Software projects are dependent on many third-party libraries, therefore high-risk vulnerabilities can propagate through the dependency chain to downstream projects.
Silent vulnerability fixes cause downstream software to be unaware of urgent security issues in a timely manner, posing a security risk to the software.
We propose GRAPE, a GRAph-based Patch rEpresentation that aims to provide a unified framework for getting vulnerability fix patches representation.
arXiv Detail & Related papers (2024-09-13T03:23:11Z) - The Impact of SBOM Generators on Vulnerability Assessment in Python: A Comparison and a Novel Approach [56.4040698609393]
Software Bill of Materials (SBOM) has been promoted as a tool to increase transparency and verifiability in software composition.
Current SBOM generation tools often suffer from inaccuracies in identifying components and dependencies.
We propose PIP-sbom, a novel pip-inspired solution that addresses their shortcomings.
arXiv Detail & Related papers (2024-09-10T10:12:37Z) - Vulnerability Detection with Code Language Models: How Far Are We? [40.455600722638906]
PrimeVul is a new dataset for training and evaluating code LMs for vulnerability detection.
It incorporates a novel set of data labeling techniques that achieve comparable label accuracy to human-verified benchmarks.
It also implements a rigorous data de-duplication and chronological data splitting strategy to mitigate data leakage issues.
arXiv Detail & Related papers (2024-03-27T14:34:29Z) - Profile of Vulnerability Remediations in Dependencies Using Graph
Analysis [40.35284812745255]
This research introduces graph analysis methods and a modified Graph Attention Convolutional Neural Network (GAT) model.
We analyze control flow graphs to profile breaking changes in applications occurring from dependency upgrades intended to remediate vulnerabilities.
Results demonstrate the effectiveness of the enhanced GAT model in offering nuanced insights into the relational dynamics of code vulnerabilities.
arXiv Detail & Related papers (2024-03-08T02:01:47Z) - ReposVul: A Repository-Level High-Quality Vulnerability Dataset [13.90550557801464]
We propose an automated data collection framework and construct the first repository-level high-quality vulnerability dataset named ReposVul.
The proposed framework mainly contains three modules: (1) A vulnerability untangling module, aiming at distinguishing vulnerability-fixing related code changes from tangled patches, in which the Large Language Models (LLMs) and static analysis tools are jointly employed, (2) A multi-granularity dependency extraction module, aiming at capturing the inter-procedural call relationships of vulnerabilities, in which we construct multiple-granularity information for each vulnerability patch, including repository-level, file-level, function-level
arXiv Detail & Related papers (2024-01-24T01:27:48Z) - SliceLocator: Locating Vulnerable Statements with Graph-based Detectors [33.395068754566935]
SliceLocator identifies the most relevant taint flow by selecting the highest-weighted flow path from all potential vulnerability-triggering statements.<n>We demonstrate that SliceLocator consistently performs well on four state-of-the-art GNN-based vulnerability detectors.
arXiv Detail & Related papers (2024-01-05T10:15:04Z) - Just-in-Time Detection of Silent Security Patches [7.840762542485285]
Security patches can be em silent, i.e., they do not always come with comprehensive advisories such as CVEs.
This lack of transparency leaves users oblivious to available security updates, providing ample opportunity for attackers to exploit unpatched vulnerabilities.
We propose to leverage large language models (LLMs) to augment patch information with generated code change explanations.
arXiv Detail & Related papers (2023-12-02T22:53:26Z) - CompVPD: Iteratively Identifying Vulnerability Patches Based on Human Validation Results with a Precise Context [16.69634193308039]
It is challenging to apply security patches in open source software timely because notifications of patches are often incomplete and delayed.
We propose a multi-granularity slicing algorithm and an adaptive-expanding algorithm to accurately identify code related to the patches.
We empirically compare CompVPD with four state-of-the-art/practice (SOTA) approaches in identifying vulnerability patches.
arXiv Detail & Related papers (2023-10-04T02:08:18Z) - REEF: A Framework for Collecting Real-World Vulnerabilities and Fixes [40.401211102969356]
We propose an automated collecting framework REEF to collect REal-world vulnErabilities and Fixes from open-source repositories.
We develop a multi-language crawler to collect vulnerabilities and their fixes, and design metrics to filter for high-quality vulnerability-fix pairs.
Through extensive experiments, we demonstrate that our approach can collect high-quality vulnerability-fix pairs and generate strong explanations.
arXiv Detail & Related papers (2023-09-15T02:50:08Z) - VELVET: a noVel Ensemble Learning approach to automatically locate
VulnErable sTatements [62.93814803258067]
This paper presents VELVET, a novel ensemble learning approach to locate vulnerable statements in source code.
Our model combines graph-based and sequence-based neural networks to successfully capture the local and global context of a program graph.
VELVET achieves 99.6% and 43.6% top-1 accuracy over synthetic data and real-world data, respectively.
arXiv Detail & Related papers (2021-12-20T22:45:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.