Detecting Code Clones with Graph Neural Networkand Flow-Augmented
Abstract Syntax Tree
- URL: http://arxiv.org/abs/2002.08653v1
- Date: Thu, 20 Feb 2020 10:18:37 GMT
- Title: Detecting Code Clones with Graph Neural Networkand Flow-Augmented
Abstract Syntax Tree
- Authors: Wenhan Wang, Ge Li, Bo Ma, Xin Xia, Zhi Jin
- Abstract summary: We build a graph representation of programs called flow-augmented abstract syntax tree (FA-AST)
We apply two different types of graph neural networks on FA-AST to measure the similarity of code pairs.
Our approach outperforms the state-of-the-art approaches on both Google Code Jam and BigCloneBench tasks.
- Score: 30.484662671342935
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Code clones are semantically similar code fragments pairs that are
syntactically similar or different. Detection of code clones can help to reduce
the cost of software maintenance and prevent bugs. Numerous approaches of
detecting code clones have been proposed previously, but most of them focus on
detecting syntactic clones and do not work well on semantic clones with
different syntactic features. To detect semantic clones, researchers have tried
to adopt deep learning for code clone detection to automatically learn latent
semantic features from data. Especially, to leverage grammar information,
several approaches used abstract syntax trees (AST) as input and achieved
significant progress on code clone benchmarks in various programming languages.
However, these AST-based approaches still can not fully leverage the structural
information of code fragments, especially semantic information such as control
flow and data flow. To leverage control and data flow information, in this
paper, we build a graph representation of programs called flow-augmented
abstract syntax tree (FA-AST). We construct FA-AST by augmenting original ASTs
with explicit control and data flow edges. Then we apply two different types of
graph neural networks (GNN) on FA-AST to measure the similarity of code pairs.
As far as we have concerned, we are the first to apply graph neural networks on
the domain of code clone detection.
We apply our FA-AST and graph neural networks on two Java datasets: Google
Code Jam and BigCloneBench. Our approach outperforms the state-of-the-art
approaches on both Google Code Jam and BigCloneBench tasks.
Related papers
- CC2Vec: Combining Typed Tokens with Contrastive Learning for Effective Code Clone Detection [20.729032739935132]
CC2Vec is a novel code encoding method designed to swiftly identify simple code clones.
We evaluate CC2Vec on two widely used datasets (i.e., BigCloneBench and Google Code Jam)
arXiv Detail & Related papers (2024-05-01T10:18:31Z) - Source Code Vulnerability Detection: Combining Code Language Models and Code Property Graphs [5.953617559607503]
Vul-LMGNN is a unified model that combines pre-trained code language models with code property graphs.
Vul-LMGNN constructs a code property graph that integrates various code attributes into a unified graph structure.
To effectively retain dependency information among various attributes, we introduce a gated code Graph Neural Network.
arXiv Detail & Related papers (2024-04-23T03:48:18Z) - Source Code is a Graph, Not a Sequence: A Cross-Lingual Perspective on
Code Clone Detection [3.3298891718069648]
We show that graph-based methods are more suitable for code clone detection than sequence-based methods.
We show that CodeGraph outperforms CodeBERT on both data-sets, especially on cross-lingual code clones.
arXiv Detail & Related papers (2023-12-27T09:30:31Z) - CONCORD: Clone-aware Contrastive Learning for Source Code [64.51161487524436]
Self-supervised pre-training has gained traction for learning generic code representations valuable for many downstream SE tasks.
We argue that it is also essential to factor in how developers code day-to-day for general-purpose representation learning.
In particular, we propose CONCORD, a self-supervised, contrastive learning strategy to place benign clones closer in the representation space while moving deviants further apart.
arXiv Detail & Related papers (2023-06-05T20:39:08Z) - ASTRO: An AST-Assisted Approach for Generalizable Neural Clone Detection [12.794933981621941]
Most neural clone detection methods do not generalize beyond the scope of clones that appear in the training dataset.
We present an Abstract Syntax Tree (AST) assisted approach for generalizable neural clone detection, or ASTRO.
Our experimental results show that ASTRO improves state-of-the-art neural clone detection approaches in both recall and F-1 scores.
arXiv Detail & Related papers (2022-08-17T04:50:51Z) - Evaluation of Contrastive Learning with Various Code Representations for
Code Clone Detection [3.699097874146491]
We evaluate contrastive learning for detecting semantic clones of code snippets.
We use CodeTransformator to create a dataset that mimics plagiarised code based on competitive programming solutions.
The results of our evaluation show that proposed models perform diversely in each task, however the performance of the graph-based models is generally above the others.
arXiv Detail & Related papers (2022-06-17T12:25:44Z) - GraphCodeBERT: Pre-training Code Representations with Data Flow [97.00641522327699]
We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code.
We use data flow in the pre-training stage, which is a semantic-level structure of code that encodes the relation of "where-the-value-comes-from" between variables.
We evaluate our model on four tasks, including code search, clone detection, code translation, and code refinement.
arXiv Detail & Related papers (2020-09-17T15:25:56Z) - Contrastive Code Representation Learning [95.86686147053958]
We show that the popular reconstruction-based BERT model is sensitive to source code edits, even when the edits preserve semantics.
We propose ContraCode: a contrastive pre-training task that learns code functionality, not form.
arXiv Detail & Related papers (2020-07-09T17:59:06Z) - Learning to map source code to software vulnerability using
code-as-a-graph [67.62847721118142]
We explore the applicability of Graph Neural Networks in learning the nuances of source code from a security perspective.
We show that a code-as-graph encoding is more meaningful for vulnerability detection than existing code-as-photo and linear sequence encoding approaches.
arXiv Detail & Related papers (2020-06-15T16:05:27Z) - Improved Code Summarization via a Graph Neural Network [96.03715569092523]
In general, source code summarization techniques use the source code as input and outputs a natural language description.
We present an approach that uses a graph-based neural architecture that better matches the default structure of the AST to generate these summaries.
arXiv Detail & Related papers (2020-04-06T17:36:42Z) - Auto-Encoding Twin-Bottleneck Hashing [141.5378966676885]
This paper proposes an efficient and adaptive code-driven graph.
It is updated by decoding in the context of an auto-encoder.
Experiments on benchmarked datasets clearly show the superiority of our framework over the state-of-the-art hashing methods.
arXiv Detail & Related papers (2020-02-27T05:58:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.