Related papers: Evaluating Pre-Trained Models for Multi-Language Vulnerability Patching

Evaluating Pre-Trained Models for Multi-Language Vulnerability Patching

URL: http://arxiv.org/abs/2501.07339v1
Date: Mon, 13 Jan 2025 13:51:05 GMT
Title: Evaluating Pre-Trained Models for Multi-Language Vulnerability Patching
Authors: Zanis Ali Khan, Aayush Garg, Yuejun Guo, Qiang Tang,
Abstract summary: This paper investigates the potential of pre-trained language models, CodeBERT and CodeT5, for automated vulnerability patching.<n>We evaluate these models on their accuracy, computational efficiency, and how the length of vulnerable code patches impacts performance.
Score: 3.220818227251765
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Software vulnerabilities pose critical security risks, demanding prompt and effective mitigation strategies. While advancements in Automated Program Repair (APR) have primarily targeted general software bugs, the domain of vulnerability patching, which is a security-critical subset of APR, remains underexplored. This paper investigates the potential of pre-trained language models, CodeBERT and CodeT5, for automated vulnerability patching across diverse datasets and five programming languages. We evaluate these models on their accuracy, computational efficiency, and how the length of vulnerable code patches impacts performance. Our findings reveal promising accuracy levels, particularly for CodeT5 on datasets with complex vulnerability patterns, while CodeBERT demonstrates strengths in handling fragmented or context-limited datasets. CodeT5 further showcases superior efficiency, making it well-suited for large-scale applications. However, both models face challenges in maintaining performance as patch length increases, highlighting the complexity of addressing extended in program repair specifically aimed at fixing vulnerabilities. This study benchmarks model performance, highlights key limitations, and offers insights to improve automated vulnerability patching for practical security applications.

Related papers

Secure Tug-of-War (SecTOW): Iterative Defense-Attack Training with Reinforcement Learning for Multimodal Model Security [63.41350337821108]
We propose Secure Tug-of-War (SecTOW) to enhance the security of multimodal large language models (MLLMs)<n>SecTOW consists of two modules: a defender and an auxiliary attacker, both trained iteratively using reinforcement learning (GRPO)<n>We show that SecTOW significantly improves security while preserving general performance.
arXiv Detail & Related papers (2025-07-29T17:39:48Z)
White-Basilisk: A Hybrid Model for Code Vulnerability Detection [50.49233187721795]
We introduce White-Basilisk, a novel approach to vulnerability detection that demonstrates superior performance.<n>White-Basilisk achieves results in vulnerability detection tasks with a parameter count of only 200M.<n>This research establishes new benchmarks in code security and provides empirical evidence that compact, efficiently designed models can outperform larger counterparts in specialized tasks.
arXiv Detail & Related papers (2025-07-11T12:39:25Z)
Boosting Vulnerability Detection of LLMs via Curriculum Preference Optimization with Synthetic Reasoning Data [22.557961978833386]
We propose a novel framework for large language models (LLMs) that excels at mining vulnerability patterns.<n>Specifically, we construct forward and backward reasoning processes for vulnerability and corresponding fixed code, ensuring the synthesis of high-quality reasoning data.<n>We show that ReVD sets new state-of-the-art for LLM-based software vulnerability detection, e.g., 12.24%-22.77% improvement in the accuracy.
arXiv Detail & Related papers (2025-06-09T03:25:23Z)
A Multi-Dataset Evaluation of Models for Automated Vulnerability Repair [2.7674959824386858]
This study investigates pre-trained language models, CodeBERT and CodeT5, for automated vulnerability patching across six datasets and four languages.<n>We evaluate their accuracy and generalization to unknown vulnerabilities.<n>Results show that while both models face challenges with fragmented or sparse context, CodeBERT performs comparatively better in such scenarios, whereas CodeT5 excels in capturing complex vulnerability patterns.
arXiv Detail & Related papers (2025-06-05T13:00:19Z)
CyberGym: Evaluating AI Agents' Cybersecurity Capabilities with Real-World Vulnerabilities at Scale [46.76144797837242]
Large language model (LLM) agents are becoming increasingly skilled at handling cybersecurity tasks autonomously.<n>Existing benchmarks fall short, often failing to capture real-world scenarios or being limited in scope.<n>We introduce CyberGym, a large-scale and high-quality cybersecurity evaluation framework featuring 1,507 real-world vulnerabilities.
arXiv Detail & Related papers (2025-06-03T07:35:14Z)
Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [61.00662702026523]
We propose a unified Test-Time Compute scaling framework that leverages increased inference-time instead of larger models. Our framework incorporates two complementary strategies: internal TTC and external TTC. We demonstrate our textbf32B model achieves a 46% issue resolution rate, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1.
arXiv Detail & Related papers (2025-03-31T07:31:32Z)
Interactive Agents to Overcome Ambiguity in Software Engineering [61.40183840499932]
AI agents are increasingly being deployed to automate tasks, often based on ambiguous and underspecified user instructions. Making unwarranted assumptions and failing to ask clarifying questions can lead to suboptimal outcomes. We study the ability of LLM agents to handle ambiguous instructions in interactive code generation settings by evaluating proprietary and open-weight models on their performance.
arXiv Detail & Related papers (2025-02-18T17:12:26Z)
Breaking Focus: Contextual Distraction Curse in Large Language Models [68.4534308805202]
We investigate a critical vulnerability in Large Language Models (LLMs) This phenomenon arises when models fail to maintain consistent performance on questions modified with semantically coherent but irrelevant context. We propose an efficient tree-based search methodology to automatically generate CDV examples.
arXiv Detail & Related papers (2025-02-03T18:43:36Z)
Beyond the Surface: An NLP-based Methodology to Automatically Estimate CVE Relevance for CAPEC Attack Patterns [42.63501759921809]
We propose a methodology leveraging Natural Language Processing (NLP) to associate Common Vulnerabilities and Exposure (CAPEC) vulnerabilities with Common Attack Patternion and Classification (CAPEC) attack patterns.<n> Experimental evaluations demonstrate superior performance compared to state-of-the-art models.
arXiv Detail & Related papers (2025-01-13T08:39:52Z)
CRepair: CVAE-based Automatic Vulnerability Repair Technology [1.147605955490786]
Software vulnerabilities pose significant threats to the integrity, security, and reliability of modern software and its application data. To address the challenges of vulnerability repair, researchers have proposed various solutions, with learning-based automatic vulnerability repair techniques gaining widespread attention. This paper proposes CRepair, a CVAE-based automatic vulnerability repair technology aimed at fixing security vulnerabilities in system code.
arXiv Detail & Related papers (2024-11-08T12:55:04Z)
Outside the Comfort Zone: Analysing LLM Capabilities in Software Vulnerability Detection [9.652886240532741]
This paper thoroughly analyses large language models' capabilities in detecting vulnerabilities within source code. We evaluate the performance of six open-source models that are specifically trained for vulnerability detection against six general-purpose LLMs.
arXiv Detail & Related papers (2024-08-29T10:00:57Z)
Vulnerability Detection with Code Language Models: How Far Are We? [40.455600722638906]
PrimeVul is a new dataset for training and evaluating code LMs for vulnerability detection. It incorporates a novel set of data labeling techniques that achieve comparable label accuracy to human-verified benchmarks. It also implements a rigorous data de-duplication and chronological data splitting strategy to mitigate data leakage issues.
arXiv Detail & Related papers (2024-03-27T14:34:29Z)
Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities [12.82645410161464]
We evaluate the effectiveness of 16 pre-trained Large Language Models on 5,000 code samples from five diverse security datasets. Overall, LLMs show modest effectiveness in detecting vulnerabilities, obtaining an average accuracy of 62.8% and F1 score of 0.71 across datasets. We find that advanced prompting strategies that involve step-by-step analysis significantly improve performance of LLMs on real-world datasets in terms of F1 score (by upto 0.18 on average)
arXiv Detail & Related papers (2023-11-16T13:17:20Z)
REEF: A Framework for Collecting Real-World Vulnerabilities and Fixes [40.401211102969356]
We propose an automated collecting framework REEF to collect REal-world vulnErabilities and Fixes from open-source repositories. We develop a multi-language crawler to collect vulnerabilities and their fixes, and design metrics to filter for high-quality vulnerability-fix pairs. Through extensive experiments, we demonstrate that our approach can collect high-quality vulnerability-fix pairs and generate strong explanations.
arXiv Detail & Related papers (2023-09-15T02:50:08Z)
OutCenTR: A novel semi-supervised framework for predicting exploits of vulnerabilities in high-dimensional datasets [0.0]
We make use of outlier detection techniques to predict vulnerabilities that are likely to be exploited. We propose a dimensionality reduction technique, OutCenTR, that enhances the baseline outlier detection models. The results of our experiments show on average a 5-fold improvement of F1 score in comparison with state-of-the-art dimensionality reduction techniques.
arXiv Detail & Related papers (2023-04-03T00:34:41Z)
CodeLMSec Benchmark: Systematically Evaluating and Finding Security Vulnerabilities in Black-Box Code Language Models [58.27254444280376]
Large language models (LLMs) for automatic code generation have achieved breakthroughs in several programming tasks. Training data for these models is usually collected from the Internet (e.g., from open-source repositories) and is likely to contain faults and security vulnerabilities. This unsanitized training data can cause the language models to learn these vulnerabilities and propagate them during the code generation procedure.
arXiv Detail & Related papers (2023-02-08T11:54:07Z)
VELVET: a noVel Ensemble Learning approach to automatically locate VulnErable sTatements [62.93814803258067]
This paper presents VELVET, a novel ensemble learning approach to locate vulnerable statements in source code. Our model combines graph-based and sequence-based neural networks to successfully capture the local and global context of a program graph. VELVET achieves 99.6% and 43.6% top-1 accuracy over synthetic data and real-world data, respectively.
arXiv Detail & Related papers (2021-12-20T22:45:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.