An Unbiased Transformer Source Code Learning with Semantic Vulnerability
Graph
- URL: http://arxiv.org/abs/2304.11072v1
- Date: Mon, 17 Apr 2023 20:54:14 GMT
- Title: An Unbiased Transformer Source Code Learning with Semantic Vulnerability
Graph
- Authors: Nafis Tanveer Islam, Gonzalo De La Torre Parra, Dylan Manuel, Elias
Bou-Harb, Peyman Najafirad
- Abstract summary: Current vulnerability screening techniques are ineffective at identifying novel vulnerabilities or providing developers with code vulnerability and classification.
To address these issues, we propose a joint multitasked unbiased vulnerability classifier comprising a transformer "RoBERTa" and graph convolution neural network (GCN)
We present a training process utilizing a semantic vulnerability graph (SVG) representation from source code, created by integrating edges from a sequential flow, control flow, and data flow, as well as a novel flow dubbed Poacher Flow (PF)
- Score: 3.3598755777055374
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Over the years, open-source software systems have become prey to threat
actors. Even as open-source communities act quickly to patch the breach, code
vulnerability screening should be an integral part of agile software
development from the beginning. Unfortunately, current vulnerability screening
techniques are ineffective at identifying novel vulnerabilities or providing
developers with code vulnerability and classification. Furthermore, the
datasets used for vulnerability learning often exhibit distribution shifts from
the real-world testing distribution due to novel attack strategies deployed by
adversaries and as a result, the machine learning model's performance may be
hindered or biased. To address these issues, we propose a joint interpolated
multitasked unbiased vulnerability classifier comprising a transformer
"RoBERTa" and graph convolution neural network (GCN). We present a training
process utilizing a semantic vulnerability graph (SVG) representation from
source code, created by integrating edges from a sequential flow, control flow,
and data flow, as well as a novel flow dubbed Poacher Flow (PF). Poacher flow
edges reduce the gap between dynamic and static program analysis and handle
complex long-range dependencies. Moreover, our approach reduces biases of
classifiers regarding unbalanced datasets by integrating Focal Loss objective
function along with SVG. Remarkably, experimental results show that our
classifier outperforms state-of-the-art results on vulnerability detection with
fewer false negatives and false positives. After testing our model across
multiple datasets, it shows an improvement of at least 2.41% and 18.75% in the
best-case scenario. Evaluations using N-day program samples demonstrate that
our proposed approach achieves a 93% accuracy and was able to detect 4,
zero-day vulnerabilities from popular GitHub repositories.
Related papers
- A Combined Feature Embedding Tools for Multi-Class Software Defect and Identification [2.2020053359163305]
We present CodeGraphNet, an experimental method that combines GraphCodeBERT and Graph Convolutional Network approaches.
This method captures intricate relation- ships between features, providing for more exact identification and separation of vulnerabilities.
The DeepTree model, which is a hybrid of a Decision Tree and a Neural Network, outperforms state-of-the-art approaches.
arXiv Detail & Related papers (2024-11-26T17:33:02Z) - DFEPT: Data Flow Embedding for Enhancing Pre-Trained Model Based Vulnerability Detection [7.802093464108404]
We propose a data flow embedding technique to enhance the performance of pre-trained models in vulnerability detection tasks.
Specifically, we parse data flow graphs from function-level source code, and use the data type of the variable as the node characteristics of the DFG.
Our research shows that DFEPT can provide effective vulnerability semantic information to pre-trained models, achieving an accuracy of 64.97% on the Devign dataset and an F1-Score of 47.9% on the Reveal dataset.
arXiv Detail & Related papers (2024-10-24T07:05:07Z) - Enhancing Code Vulnerability Detection via Vulnerability-Preserving Data Augmentation [29.72520866016839]
Source code vulnerability detection aims to identify inherent vulnerabilities to safeguard software systems from potential attacks.
Many prior studies overlook diverse vulnerability characteristics, simplifying the problem into a binary (0-1) classification task.
FGVulDet employs multiple classifiers to discern characteristics of various vulnerability types and combines their outputs to identify the specific type of vulnerability.
FGVulDet is trained on a large-scale dataset from GitHub, encompassing five different types of vulnerabilities.
arXiv Detail & Related papers (2024-04-15T09:10:52Z) - Client-side Gradient Inversion Against Federated Learning from Poisoning [59.74484221875662]
Federated Learning (FL) enables distributed participants to train a global model without sharing data directly to a central server.
Recent studies have revealed that FL is vulnerable to gradient inversion attack (GIA), which aims to reconstruct the original training samples.
We propose Client-side poisoning Gradient Inversion (CGI), which is a novel attack method that can be launched from clients.
arXiv Detail & Related papers (2023-09-14T03:48:27Z) - CONVERT:Contrastive Graph Clustering with Reliable Augmentation [110.46658439733106]
We propose a novel CONtrastiVe Graph ClustEring network with Reliable AugmenTation (CONVERT)
In our method, the data augmentations are processed by the proposed reversible perturb-recover network.
To further guarantee the reliability of semantics, a novel semantic loss is presented to constrain the network.
arXiv Detail & Related papers (2023-08-17T13:07:09Z) - Enhancing Multiple Reliability Measures via Nuisance-extended
Information Bottleneck [77.37409441129995]
In practical scenarios where training data is limited, many predictive signals in the data can be rather from some biases in data acquisition.
We consider an adversarial threat model under a mutual information constraint to cover a wider class of perturbations in training.
We propose an autoencoder-based training to implement the objective, as well as practical encoder designs to facilitate the proposed hybrid discriminative-generative training.
arXiv Detail & Related papers (2023-03-24T16:03:21Z) - DCDetector: An IoT terminal vulnerability mining system based on
distributed deep ensemble learning under source code representation [2.561778620560749]
The goal of the research is to intelligently detect vulnerabilities in source codes of high-level languages such as C/C++.
This enables us to propose a code representation of sensitive sentence-related slices of source code, and to detect vulnerabilities by designing a distributed deep ensemble learning model.
Experiments show that this method can reduce the false positive rate of traditional static analysis and improve the performance and accuracy of machine learning.
arXiv Detail & Related papers (2022-11-29T14:19:14Z) - VELVET: a noVel Ensemble Learning approach to automatically locate
VulnErable sTatements [62.93814803258067]
This paper presents VELVET, a novel ensemble learning approach to locate vulnerable statements in source code.
Our model combines graph-based and sequence-based neural networks to successfully capture the local and global context of a program graph.
VELVET achieves 99.6% and 43.6% top-1 accuracy over synthetic data and real-world data, respectively.
arXiv Detail & Related papers (2021-12-20T22:45:27Z) - Software Vulnerability Detection via Deep Learning over Disaggregated
Code Graph Representation [57.92972327649165]
This work explores a deep learning approach to automatically learn the insecure patterns from code corpora.
Because code naturally admits graph structures with parsing, we develop a novel graph neural network (GNN) to exploit both the semantic context and structural regularity of a program.
arXiv Detail & Related papers (2021-09-07T21:24:36Z) - Information Obfuscation of Graph Neural Networks [96.8421624921384]
We study the problem of protecting sensitive attributes by information obfuscation when learning with graph structured data.
We propose a framework to locally filter out pre-determined sensitive attributes via adversarial training with the total variation and the Wasserstein distance.
arXiv Detail & Related papers (2020-09-28T17:55:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.