Related papers: Enhancing Pre-Trained Language Models for Vulnerability Detection via Semantic-Preserving Data Augmentation

Enhancing Pre-Trained Language Models for Vulnerability Detection via Semantic-Preserving Data Augmentation

URL: http://arxiv.org/abs/2410.00249v2
Date: Thu, 3 Oct 2024 00:37:10 GMT
Title: Enhancing Pre-Trained Language Models for Vulnerability Detection via Semantic-Preserving Data Augmentation
Authors: Weiliang Qi, Jiahao Cao, Darsh Poddar, Sophia Li, Xinda Wang,
Abstract summary: We propose a data augmentation technique aimed at enhancing the performance of pre-trained language models for vulnerability detection. By incorporating our augmented dataset in fine-tuning a series of representative code pre-trained models, up to 10.1% increase in accuracy and 23.6% increase in F1 can be achieved.
Score: 4.374800396968465
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the rapid development and widespread use of advanced network systems, software vulnerabilities pose a significant threat to secure communications and networking. Learning-based vulnerability detection systems, particularly those leveraging pre-trained language models, have demonstrated significant potential in promptly identifying vulnerabilities in communication networks and reducing the risk of exploitation. However, the shortage of accurately labeled vulnerability datasets hinders further progress in this field. Failing to represent real-world vulnerability data variety and preserve vulnerability semantics, existing augmentation approaches provide limited or even counterproductive contributions to model training. In this paper, we propose a data augmentation technique aimed at enhancing the performance of pre-trained language models for vulnerability detection. Given the vulnerability dataset, our method performs natural semantic-preserving program transformation to generate a large volume of new samples with enriched data diversity and variety. By incorporating our augmented dataset in fine-tuning a series of representative code pre-trained models (i.e., CodeBERT, GraphCodeBERT, UnixCoder, and PDBERT), up to 10.1% increase in accuracy and 23.6% increase in F1 can be achieved in the vulnerability detection task. Comparison results also show that our proposed method can substantially outperform other prominent vulnerability augmentation approaches.

Related papers

Lie Detector: Unified Backdoor Detection via Cross-Examination Framework [68.45399098884364]
We propose a unified backdoor detection framework in the semi-honest setting. Our method achieves superior detection performance, improving accuracy by 5.4%, 1.6%, and 11.9% over SoTA baselines. Notably, it is the first to effectively detect backdoors in multimodal large language models.
arXiv Detail & Related papers (2025-03-21T06:12:06Z)
In-Context Experience Replay Facilitates Safety Red-Teaming of Text-to-Image Diffusion Models [97.82118821263825]
Text-to-image (T2I) models have shown remarkable progress, but their potential to generate harmful content remains a critical concern in the ML community. We propose ICER, a novel red-teaming framework that generates interpretable and semantic meaningful problematic prompts. Our work provides crucial insights for developing more robust safety mechanisms in T2I systems.
arXiv Detail & Related papers (2024-11-25T04:17:24Z)
CRepair: CVAE-based Automatic Vulnerability Repair Technology [1.147605955490786]
Software vulnerabilities pose significant threats to the integrity, security, and reliability of modern software and its application data. To address the challenges of vulnerability repair, researchers have proposed various solutions, with learning-based automatic vulnerability repair techniques gaining widespread attention. This paper proposes CRepair, a CVAE-based automatic vulnerability repair technology aimed at fixing security vulnerabilities in system code.
arXiv Detail & Related papers (2024-11-08T12:55:04Z)
DFEPT: Data Flow Embedding for Enhancing Pre-Trained Model Based Vulnerability Detection [7.802093464108404]
We propose a data flow embedding technique to enhance the performance of pre-trained models in vulnerability detection tasks. Specifically, we parse data flow graphs from function-level source code, and use the data type of the variable as the node characteristics of the DFG. Our research shows that DFEPT can provide effective vulnerability semantic information to pre-trained models, achieving an accuracy of 64.97% on the Devign dataset and an F1-Score of 47.9% on the Reveal dataset.
arXiv Detail & Related papers (2024-10-24T07:05:07Z)
Adversarial Robustification via Text-to-Image Diffusion Models [56.37291240867549]
Adrial robustness has been conventionally believed as a challenging property to encode for neural networks. We develop a scalable and model-agnostic solution to achieve adversarial robustness without using any data.
arXiv Detail & Related papers (2024-07-26T10:49:14Z)
Security Vulnerability Detection with Multitask Self-Instructed Fine-Tuning of Large Language Models [8.167614500821223]
We introduce MSIVD, multitask self-instructed fine-tuning for vulnerability detection, inspired by chain-of-thought prompting and LLM self-instruction. Our experiments demonstrate that MSIVD achieves superior performance, outperforming the highest LLM-based vulnerability detector baseline (LineVul) with a F1 score of 0.92 on the BigVul dataset, and 0.48 on the PreciseBugs dataset.
arXiv Detail & Related papers (2024-06-09T19:18:05Z)
Enhancing Code Vulnerability Detection via Vulnerability-Preserving Data Augmentation [29.72520866016839]
Source code vulnerability detection aims to identify inherent vulnerabilities to safeguard software systems from potential attacks. Many prior studies overlook diverse vulnerability characteristics, simplifying the problem into a binary (0-1) classification task. FGVulDet employs multiple classifiers to discern characteristics of various vulnerability types and combines their outputs to identify the specific type of vulnerability. FGVulDet is trained on a large-scale dataset from GitHub, encompassing five different types of vulnerabilities.
arXiv Detail & Related papers (2024-04-15T09:10:52Z)
FaultGuard: A Generative Approach to Resilient Fault Prediction in Smart Electrical Grids [53.2306792009435]
FaultGuard is the first framework for fault type and zone classification resilient to adversarial attacks. We propose a low-complexity fault prediction model and an online adversarial training technique to enhance robustness. Our model outclasses the state-of-the-art for resilient fault prediction benchmarking, with an accuracy of up to 0.958.
arXiv Detail & Related papers (2024-03-26T08:51:23Z)
Enhancing Multiple Reliability Measures via Nuisance-extended Information Bottleneck [77.37409441129995]
In practical scenarios where training data is limited, many predictive signals in the data can be rather from some biases in data acquisition. We consider an adversarial threat model under a mutual information constraint to cover a wider class of perturbations in training. We propose an autoencoder-based training to implement the objective, as well as practical encoder designs to facilitate the proposed hybrid discriminative-generative training.
arXiv Detail & Related papers (2023-03-24T16:03:21Z)
Improving robustness of jet tagging algorithms with adversarial training [56.79800815519762]
We investigate the vulnerability of flavor tagging algorithms via application of adversarial attacks. We present an adversarial training strategy that mitigates the impact of such simulated attacks.
arXiv Detail & Related papers (2022-03-25T19:57:19Z)
VELVET: a noVel Ensemble Learning approach to automatically locate VulnErable sTatements [62.93814803258067]
This paper presents VELVET, a novel ensemble learning approach to locate vulnerable statements in source code. Our model combines graph-based and sequence-based neural networks to successfully capture the local and global context of a program graph. VELVET achieves 99.6% and 43.6% top-1 accuracy over synthetic data and real-world data, respectively.
arXiv Detail & Related papers (2021-12-20T22:45:27Z)
V2W-BERT: A Framework for Effective Hierarchical Multiclass Classification of Software Vulnerabilities [7.906207218788341]
We present a novel Transformer-based learning framework (V2W-BERT) in this paper. By using ideas from natural language processing, link prediction and transfer learning, our method outperforms previous approaches. We achieve up to 97% prediction accuracy for randomly partitioned data and up to 94% prediction accuracy in temporally partitioned data.
arXiv Detail & Related papers (2021-02-23T05:16:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.