Gitor: Scalable Code Clone Detection by Building Global Sample Graph
- URL: http://arxiv.org/abs/2311.08778v2
- Date: Sat, 18 Nov 2023 13:46:07 GMT
- Title: Gitor: Scalable Code Clone Detection by Building Global Sample Graph
- Authors: Junjie Shan, Shihan Dou, Yueming Wu, Hairu Wu, Yang Liu
- Abstract summary: We propose Gitor to capture the underlying connections among different code samples.
Gitor has higher accuracy in terms of code clone detection and excellent execution time for inputs of various sizes.
- Score: 11.041017540277558
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Code clone detection is about finding out similar code fragments, which has
drawn much attention in software engineering since it is important for software
maintenance and evolution. Researchers have proposed many techniques and tools
for source code clone detection, but current detection methods concentrate on
analyzing or processing code samples individually without exploring the
underlying connections among code samples. In this paper, we propose Gitor to
capture the underlying connections among different code samples. Specifically,
given a source code database, we first tokenize all code samples to extract the
pre-defined individual information. After obtaining all samples individual
information, we leverage them to build a large global sample graph where each
node is a code sample or a type of individual information. Then we apply a node
embedding technique on the global sample graph to extract all the samples
vector representations. After collecting all code samples vectors, we can
simply compare the similarity between any two samples to detect possible clone
pairs. More importantly, since the obtained vector of a sample is from a global
sample graph, we can combine it with its own code features to improve the code
clone detection performance. To demonstrate the effectiveness of Gitor, we
evaluate it on a widely used dataset namely BigCloneBench. Our experimental
results show that Gitor has higher accuracy in terms of code clone detection
and excellent execution time for inputs of various sizes compared to existing
state-of-the-art tools. Moreover, we also evaluate the combination of Gitor
with other traditional vector-based clone detection methods, the results show
that the use of Gitor enables them detect more code clones with higher F1.
Related papers
- Advanced Detection of Source Code Clones via an Ensemble of Unsupervised Similarity Measures [0.0]
This research introduces a novel ensemble learning approach for code similarity assessment.
The key idea is that the strengths of a diverse set of similarity measures can complement each other and mitigate individual weaknesses.
arXiv Detail & Related papers (2024-05-03T13:42:49Z) - CC2Vec: Combining Typed Tokens with Contrastive Learning for Effective Code Clone Detection [20.729032739935132]
CC2Vec is a novel code encoding method designed to swiftly identify simple code clones.
We evaluate CC2Vec on two widely used datasets (i.e., BigCloneBench and Google Code Jam)
arXiv Detail & Related papers (2024-05-01T10:18:31Z) - Source Code is a Graph, Not a Sequence: A Cross-Lingual Perspective on
Code Clone Detection [3.3298891718069648]
We show that graph-based methods are more suitable for code clone detection than sequence-based methods.
We show that CodeGraph outperforms CodeBERT on both data-sets, especially on cross-lingual code clones.
arXiv Detail & Related papers (2023-12-27T09:30:31Z) - RepoCoder: Repository-Level Code Completion Through Iterative Retrieval
and Generation [96.75695811963242]
RepoCoder is a framework to streamline the repository-level code completion process.
It incorporates a similarity-based retriever and a pre-trained code language model.
It consistently outperforms the vanilla retrieval-augmented code completion approach.
arXiv Detail & Related papers (2023-03-22T13:54:46Z) - Evaluation of Contrastive Learning with Various Code Representations for
Code Clone Detection [3.699097874146491]
We evaluate contrastive learning for detecting semantic clones of code snippets.
We use CodeTransformator to create a dataset that mimics plagiarised code based on competitive programming solutions.
The results of our evaluation show that proposed models perform diversely in each task, however the performance of the graph-based models is generally above the others.
arXiv Detail & Related papers (2022-06-17T12:25:44Z) - Trustable Co-label Learning from Multiple Noisy Annotators [68.59187658490804]
Supervised deep learning depends on massive accurately annotated examples.
A typical alternative is learning from multiple noisy annotators.
This paper proposes a data-efficient approach, called emphTrustable Co-label Learning (TCL)
arXiv Detail & Related papers (2022-03-08T16:57:00Z) - Cherry-Picking Gradients: Learning Low-Rank Embeddings of Visual Data
via Differentiable Cross-Approximation [53.95297550117153]
We propose an end-to-end trainable framework that processes large-scale visual data tensors by looking emphat a fraction of their entries only.
The proposed approach is particularly useful for large-scale multidimensional grid data, and for tasks that require context over a large receptive field.
arXiv Detail & Related papers (2021-05-29T08:39:57Z) - Deep ensembles based on Stochastic Activation Selection for Polyp
Segmentation [82.61182037130406]
This work deals with medical image segmentation and in particular with accurate polyp detection and segmentation during colonoscopy examinations.
Basic architecture in image segmentation consists of an encoder and a decoder.
We compare some variant of the DeepLab architecture obtained by varying the decoder backbone.
arXiv Detail & Related papers (2021-04-02T02:07:37Z) - Learning to map source code to software vulnerability using
code-as-a-graph [67.62847721118142]
We explore the applicability of Graph Neural Networks in learning the nuances of source code from a security perspective.
We show that a code-as-graph encoding is more meaningful for vulnerability detection than existing code-as-photo and linear sequence encoding approaches.
arXiv Detail & Related papers (2020-06-15T16:05:27Z) - Auto-Encoding Twin-Bottleneck Hashing [141.5378966676885]
This paper proposes an efficient and adaptive code-driven graph.
It is updated by decoding in the context of an auto-encoder.
Experiments on benchmarked datasets clearly show the superiority of our framework over the state-of-the-art hashing methods.
arXiv Detail & Related papers (2020-02-27T05:58:12Z) - Detecting Code Clones with Graph Neural Networkand Flow-Augmented
Abstract Syntax Tree [30.484662671342935]
We build a graph representation of programs called flow-augmented abstract syntax tree (FA-AST)
We apply two different types of graph neural networks on FA-AST to measure the similarity of code pairs.
Our approach outperforms the state-of-the-art approaches on both Google Code Jam and BigCloneBench tasks.
arXiv Detail & Related papers (2020-02-20T10:18:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.