Related papers: Detecting Code Clones with Graph Neural Networkand Flow-Augmented Abstract Syntax Tree

Detecting Code Clones with Graph Neural Networkand Flow-Augmented Abstract Syntax Tree

URL: http://arxiv.org/abs/2002.08653v1
Date: Thu, 20 Feb 2020 10:18:37 GMT
Title: Detecting Code Clones with Graph Neural Networkand Flow-Augmented Abstract Syntax Tree
Authors: Wenhan Wang, Ge Li, Bo Ma, Xin Xia, Zhi Jin
Abstract summary: We build a graph representation of programs called flow-augmented abstract syntax tree (FA-AST) We apply two different types of graph neural networks on FA-AST to measure the similarity of code pairs. Our approach outperforms the state-of-the-art approaches on both Google Code Jam and BigCloneBench tasks.
Score: 30.484662671342935
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Code clones are semantically similar code fragments pairs that are syntactically similar or different. Detection of code clones can help to reduce the cost of software maintenance and prevent bugs. Numerous approaches of detecting code clones have been proposed previously, but most of them focus on detecting syntactic clones and do not work well on semantic clones with different syntactic features. To detect semantic clones, researchers have tried to adopt deep learning for code clone detection to automatically learn latent semantic features from data. Especially, to leverage grammar information, several approaches used abstract syntax trees (AST) as input and achieved significant progress on code clone benchmarks in various programming languages. However, these AST-based approaches still can not fully leverage the structural information of code fragments, especially semantic information such as control flow and data flow. To leverage control and data flow information, in this paper, we build a graph representation of programs called flow-augmented abstract syntax tree (FA-AST). We construct FA-AST by augmenting original ASTs with explicit control and data flow edges. Then we apply two different types of graph neural networks (GNN) on FA-AST to measure the similarity of code pairs. As far as we have concerned, we are the first to apply graph neural networks on the domain of code clone detection. We apply our FA-AST and graph neural networks on two Java datasets: Google Code Jam and BigCloneBench. Our approach outperforms the state-of-the-art approaches on both Google Code Jam and BigCloneBench tasks.

Related papers

AST-Enhanced or AST-Overloaded? The Surprising Impact of Hybrid Graph Representations on Code Clone Detection [0.0]
Code clones significantly increase software maintenance costs and heighten vulnerability risks.<n>ASTs dominate deep learning-based code clone detection due to their precise syntactic structure representation.<n>Recent studies address this by enriching AST-based representations with semantic graphs.
arXiv Detail & Related papers (2025-06-17T12:35:17Z)
CC2Vec: Combining Typed Tokens with Contrastive Learning for Effective Code Clone Detection [20.729032739935132]
CC2Vec is a novel code encoding method designed to swiftly identify simple code clones. We evaluate CC2Vec on two widely used datasets (i.e., BigCloneBench and Google Code Jam)
arXiv Detail & Related papers (2024-05-01T10:18:31Z)
Source Code Vulnerability Detection: Combining Code Language Models and Code Property Graphs [5.953617559607503]
Vul-LMGNN is a unified model that combines pre-trained code language models with code property graphs. Vul-LMGNN constructs a code property graph that integrates various code attributes into a unified graph structure. To effectively retain dependency information among various attributes, we introduce a gated code Graph Neural Network.
arXiv Detail & Related papers (2024-04-23T03:48:18Z)
Source Code is a Graph, Not a Sequence: A Cross-Lingual Perspective on Code Clone Detection [3.3298891718069648]
We show that graph-based methods are more suitable for code clone detection than sequence-based methods. We show that CodeGraph outperforms CodeBERT on both data-sets, especially on cross-lingual code clones.
arXiv Detail & Related papers (2023-12-27T09:30:31Z)
Gitor: Scalable Code Clone Detection by Building Global Sample Graph [11.041017540277558]
We propose Gitor to capture the underlying connections among different code samples. Gitor has higher accuracy in terms of code clone detection and excellent execution time for inputs of various sizes.
arXiv Detail & Related papers (2023-11-15T08:48:50Z)
CONCORD: Clone-aware Contrastive Learning for Source Code [64.51161487524436]
Self-supervised pre-training has gained traction for learning generic code representations valuable for many downstream SE tasks. We argue that it is also essential to factor in how developers code day-to-day for general-purpose representation learning. In particular, we propose CONCORD, a self-supervised, contrastive learning strategy to place benign clones closer in the representation space while moving deviants further apart.
arXiv Detail & Related papers (2023-06-05T20:39:08Z)
ASTRO: An AST-Assisted Approach for Generalizable Neural Clone Detection [12.794933981621941]
Most neural clone detection methods do not generalize beyond the scope of clones that appear in the training dataset. We present an Abstract Syntax Tree (AST) assisted approach for generalizable neural clone detection, or ASTRO. Our experimental results show that ASTRO improves state-of-the-art neural clone detection approaches in both recall and F-1 scores.
arXiv Detail & Related papers (2022-08-17T04:50:51Z)
Evaluation of Contrastive Learning with Various Code Representations for Code Clone Detection [3.699097874146491]
We evaluate contrastive learning for detecting semantic clones of code snippets. We use CodeTransformator to create a dataset that mimics plagiarised code based on competitive programming solutions. The results of our evaluation show that proposed models perform diversely in each task, however the performance of the graph-based models is generally above the others.
arXiv Detail & Related papers (2022-06-17T12:25:44Z)
GraphCodeBERT: Pre-training Code Representations with Data Flow [97.00641522327699]
We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code. We use data flow in the pre-training stage, which is a semantic-level structure of code that encodes the relation of "where-the-value-comes-from" between variables. We evaluate our model on four tasks, including code search, clone detection, code translation, and code refinement.
arXiv Detail & Related papers (2020-09-17T15:25:56Z)
Contrastive Code Representation Learning [95.86686147053958]
We show that the popular reconstruction-based BERT model is sensitive to source code edits, even when the edits preserve semantics. We propose ContraCode: a contrastive pre-training task that learns code functionality, not form.
arXiv Detail & Related papers (2020-07-09T17:59:06Z)
Learning to map source code to software vulnerability using code-as-a-graph [67.62847721118142]
We explore the applicability of Graph Neural Networks in learning the nuances of source code from a security perspective. We show that a code-as-graph encoding is more meaningful for vulnerability detection than existing code-as-photo and linear sequence encoding approaches.
arXiv Detail & Related papers (2020-06-15T16:05:27Z)
Improved Code Summarization via a Graph Neural Network [96.03715569092523]
In general, source code summarization techniques use the source code as input and outputs a natural language description. We present an approach that uses a graph-based neural architecture that better matches the default structure of the AST to generate these summaries.
arXiv Detail & Related papers (2020-04-06T17:36:42Z)
Auto-Encoding Twin-Bottleneck Hashing [141.5378966676885]
This paper proposes an efficient and adaptive code-driven graph. It is updated by decoding in the context of an auto-encoder. Experiments on benchmarked datasets clearly show the superiority of our framework over the state-of-the-art hashing methods.
arXiv Detail & Related papers (2020-02-27T05:58:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.