Learning to map source code to software vulnerability using
code-as-a-graph
- URL: http://arxiv.org/abs/2006.08614v1
- Date: Mon, 15 Jun 2020 16:05:27 GMT
- Title: Learning to map source code to software vulnerability using
code-as-a-graph
- Authors: Sahil Suneja, Yunhui Zheng, Yufan Zhuang, Jim Laredo, Alessandro
Morari
- Abstract summary: We explore the applicability of Graph Neural Networks in learning the nuances of source code from a security perspective.
We show that a code-as-graph encoding is more meaningful for vulnerability detection than existing code-as-photo and linear sequence encoding approaches.
- Score: 67.62847721118142
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We explore the applicability of Graph Neural Networks in learning the nuances
of source code from a security perspective. Specifically, whether signatures of
vulnerabilities in source code can be learned from its graph representation, in
terms of relationships between nodes and edges. We create a pipeline we call
AI4VA, which first encodes a sample source code into a Code Property Graph. The
extracted graph is then vectorized in a manner which preserves its semantic
information. A Gated Graph Neural Network is then trained using several such
graphs to automatically extract templates differentiating the graph of a
vulnerable sample from a healthy one. Our model outperforms static analyzers,
classic machine learning, as well as CNN and RNN-based deep learning models on
two of the three datasets we experiment with. We thus show that a code-as-graph
encoding is more meaningful for vulnerability detection than existing
code-as-photo and linear sequence encoding approaches. (Submitted Oct 2019,
Paper #28, ICST)
Related papers
- Source Code Vulnerability Detection: Combining Code Language Models and Code Property Graphs [5.953617559607503]
Vul-LMGNN is a unified model that combines pre-trained code language models with code property graphs.
Vul-LMGNN constructs a code property graph that integrates various code attributes into a unified graph structure.
To effectively retain dependency information among various attributes, we introduce a gated code Graph Neural Network.
arXiv Detail & Related papers (2024-04-23T03:48:18Z) - Rethinking Tokenizer and Decoder in Masked Graph Modeling for Molecules [81.05116895430375]
Masked graph modeling excels in the self-supervised representation learning of molecular graphs.
We show that a subgraph-level tokenizer and a sufficiently expressive decoder with remask decoding have a large impact on the encoder's representation learning.
We propose a novel MGM method SimSGT, featuring a Simple GNN-based Tokenizer (SGT) and an effective decoding strategy.
arXiv Detail & Related papers (2023-10-23T09:40:30Z) - Sequential Graph Neural Networks for Source Code Vulnerability
Identification [5.582101184758527]
We present a properly curated C/C++ source code vulnerability dataset to aid in developing models.
We also propose a learning framework based on graph neural networks, denoted SEquential Graph Neural Network (SEGNN) for learning a large number of code semantic representations.
Our evaluations on two datasets and four baseline methods in a graph classification setting demonstrate state-of-the-art results.
arXiv Detail & Related papers (2023-05-23T17:25:51Z) - ReGVD: Revisiting Graph Neural Networks for Vulnerability Detection [20.65271290295621]
We propose ReGVD, a graph network-based model for vulnerability detection.
In particular, ReGVD views a given source code as a flat sequence of tokens.
We obtain the highest accuracy on the real-world benchmark dataset from CodeXGLUE for vulnerability detection.
arXiv Detail & Related papers (2021-10-14T12:44:38Z) - Software Vulnerability Detection via Deep Learning over Disaggregated
Code Graph Representation [57.92972327649165]
This work explores a deep learning approach to automatically learn the insecure patterns from code corpora.
Because code naturally admits graph structures with parsing, we develop a novel graph neural network (GNN) to exploit both the semantic context and structural regularity of a program.
arXiv Detail & Related papers (2021-09-07T21:24:36Z) - deGraphCS: Embedding Variable-based Flow Graph for Neural Code Search [15.19181807445119]
We propose a learnable deep Graph for Code Search (called deGraphCS) to transfer source code into variable-based flow graphs.
We collect a large-scale dataset from GitHub containing 41,152 code snippets written in C language.
arXiv Detail & Related papers (2021-03-24T06:57:44Z) - Heuristic Semi-Supervised Learning for Graph Generation Inspired by
Electoral College [80.67842220664231]
We propose a novel pre-processing technique, namely ELectoral COllege (ELCO), which automatically expands new nodes and edges to refine the label similarity within a dense subgraph.
In all setups tested, our method boosts the average score of base models by a large margin of 4.7 points, as well as consistently outperforms the state-of-the-art.
arXiv Detail & Related papers (2020-06-10T14:48:48Z) - Auto-decoding Graphs [91.3755431537592]
The generative model is an auto-decoder that learns to synthesize graphs from latent codes.
Graphs are synthesized using self-attention modules that are trained to identify likely connectivity patterns.
arXiv Detail & Related papers (2020-06-04T14:23:01Z) - Improved Code Summarization via a Graph Neural Network [96.03715569092523]
In general, source code summarization techniques use the source code as input and outputs a natural language description.
We present an approach that uses a graph-based neural architecture that better matches the default structure of the AST to generate these summaries.
arXiv Detail & Related papers (2020-04-06T17:36:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.