Source Code is a Graph, Not a Sequence: A Cross-Lingual Perspective on
Code Clone Detection
- URL: http://arxiv.org/abs/2312.16488v1
- Date: Wed, 27 Dec 2023 09:30:31 GMT
- Title: Source Code is a Graph, Not a Sequence: A Cross-Lingual Perspective on
Code Clone Detection
- Authors: Mohammed Ataaur Rahaman, Julia Ive
- Abstract summary: We show that graph-based methods are more suitable for code clone detection than sequence-based methods.
We show that CodeGraph outperforms CodeBERT on both data-sets, especially on cross-lingual code clones.
- Score: 3.3298891718069648
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Source code clone detection is the task of finding code fragments that have
the same or similar functionality, but may differ in syntax or structure. This
task is important for software maintenance, reuse, and quality assurance (Roy
et al. 2009). However, code clone detection is challenging, as source code can
be written in different languages, domains, and styles. In this paper, we argue
that source code is inherently a graph, not a sequence, and that graph-based
methods are more suitable for code clone detection than sequence-based methods.
We compare the performance of two state-of-the-art models: CodeBERT (Feng et
al. 2020), a sequence-based model, and CodeGraph (Yu et al. 2023), a
graph-based model, on two benchmark data-sets: BCB (Svajlenko et al. 2014) and
PoolC (PoolC no date). We show that CodeGraph outperforms CodeBERT on both
data-sets, especially on cross-lingual code clones. To the best of our
knowledge, this is the first work to demonstrate the superiority of graph-based
methods over sequence-based methods on cross-lingual code clone detection.
Related papers
- SparseCoder: Identifier-Aware Sparse Transformer for File-Level Code
Summarization [51.67317895094664]
This paper studies file-level code summarization, which can assist programmers in understanding and maintaining large source code projects.
We propose SparseCoder, an identifier-aware sparse transformer for effectively handling long code sequences.
arXiv Detail & Related papers (2024-01-26T09:23:27Z) - CONCORD: Clone-aware Contrastive Learning for Source Code [64.51161487524436]
Self-supervised pre-training has gained traction for learning generic code representations valuable for many downstream SE tasks.
We argue that it is also essential to factor in how developers code day-to-day for general-purpose representation learning.
In particular, we propose CONCORD, a self-supervised, contrastive learning strategy to place benign clones closer in the representation space while moving deviants further apart.
arXiv Detail & Related papers (2023-06-05T20:39:08Z) - Evaluation of Contrastive Learning with Various Code Representations for
Code Clone Detection [3.699097874146491]
We evaluate contrastive learning for detecting semantic clones of code snippets.
We use CodeTransformator to create a dataset that mimics plagiarised code based on competitive programming solutions.
The results of our evaluation show that proposed models perform diversely in each task, however the performance of the graph-based models is generally above the others.
arXiv Detail & Related papers (2022-06-17T12:25:44Z) - Enhancing Semantic Code Search with Multimodal Contrastive Learning and
Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search.
We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z) - deGraphCS: Embedding Variable-based Flow Graph for Neural Code Search [15.19181807445119]
We propose a learnable deep Graph for Code Search (called deGraphCS) to transfer source code into variable-based flow graphs.
We collect a large-scale dataset from GitHub containing 41,152 code snippets written in C language.
arXiv Detail & Related papers (2021-03-24T06:57:44Z) - GraphCodeBERT: Pre-training Code Representations with Data Flow [97.00641522327699]
We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code.
We use data flow in the pre-training stage, which is a semantic-level structure of code that encodes the relation of "where-the-value-comes-from" between variables.
We evaluate our model on four tasks, including code search, clone detection, code translation, and code refinement.
arXiv Detail & Related papers (2020-09-17T15:25:56Z) - Learning to map source code to software vulnerability using
code-as-a-graph [67.62847721118142]
We explore the applicability of Graph Neural Networks in learning the nuances of source code from a security perspective.
We show that a code-as-graph encoding is more meaningful for vulnerability detection than existing code-as-photo and linear sequence encoding approaches.
arXiv Detail & Related papers (2020-06-15T16:05:27Z) - Auto-Encoding Twin-Bottleneck Hashing [141.5378966676885]
This paper proposes an efficient and adaptive code-driven graph.
It is updated by decoding in the context of an auto-encoder.
Experiments on benchmarked datasets clearly show the superiority of our framework over the state-of-the-art hashing methods.
arXiv Detail & Related papers (2020-02-27T05:58:12Z) - Detecting Code Clones with Graph Neural Networkand Flow-Augmented
Abstract Syntax Tree [30.484662671342935]
We build a graph representation of programs called flow-augmented abstract syntax tree (FA-AST)
We apply two different types of graph neural networks on FA-AST to measure the similarity of code pairs.
Our approach outperforms the state-of-the-art approaches on both Google Code Jam and BigCloneBench tasks.
arXiv Detail & Related papers (2020-02-20T10:18:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.