Code Clone Detection based on Event Embedding and Event Dependency
- URL: http://arxiv.org/abs/2111.14183v1
- Date: Sun, 28 Nov 2021 15:50:15 GMT
- Title: Code Clone Detection based on Event Embedding and Event Dependency
- Authors: Cheng Huang, Hui Zhou, Chunyang Ye, Bingzhuo Li
- Abstract summary: We propose a code clone detection method based on semantic similarity.
By treating code as a series of interdependent events that occur continuously, we design a model namely EDAM to encode code semantic information.
Experimental results show that our EDAM model is superior to state-the-art open source models for code clone detection.
- Score: 7.652540019496754
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The code clone detection method based on semantic similarity has important
value in software engineering tasks (e.g., software evolution, software reuse).
Traditional code clone detection technologies pay more attention to the
similarity of code at the syntax level, and less attention to the semantic
similarity of the code. As a result, candidate codes similar in semantics are
ignored. To address this issue, we propose a code clone detection method based
on semantic similarity. By treating code as a series of interdependent events
that occur continuously, we design a model namely EDAM to encode code semantic
information based on event embedding and event dependency. The EDAM model uses
the event embedding method to model the execution characteristics of program
statements and the data dependence information between all statements. In this
way, we can embed the program semantic information into a vector and use the
vector to detect codes similar in semantics. Experimental results show that the
performance of our EDAM model is superior to state of-the-art open source
models for code clone detection.
Related papers
- Advanced Detection of Source Code Clones via an Ensemble of Unsupervised Similarity Measures [0.0]
This research introduces a novel ensemble learning approach for code similarity assessment.
The key idea is that the strengths of a diverse set of similarity measures can complement each other and mitigate individual weaknesses.
arXiv Detail & Related papers (2024-05-03T13:42:49Z) - CC2Vec: Combining Typed Tokens with Contrastive Learning for Effective Code Clone Detection [20.729032739935132]
CC2Vec is a novel code encoding method designed to swiftly identify simple code clones.
We evaluate CC2Vec on two widely used datasets (i.e., BigCloneBench and Google Code Jam)
arXiv Detail & Related papers (2024-05-01T10:18:31Z) - SparseCoder: Identifier-Aware Sparse Transformer for File-Level Code
Summarization [51.67317895094664]
This paper studies file-level code summarization, which can assist programmers in understanding and maintaining large source code projects.
We propose SparseCoder, an identifier-aware sparse transformer for effectively handling long code sequences.
arXiv Detail & Related papers (2024-01-26T09:23:27Z) - Source Code Clone Detection Using Unsupervised Similarity Measures [0.0]
This work presents a comparative analysis of unsupervised similarity measures for identifying source code clone detection.
The goal is to overview the current state-of-the-art techniques, their strengths, and weaknesses.
arXiv Detail & Related papers (2024-01-18T10:56:27Z) - CONCORD: Clone-aware Contrastive Learning for Source Code [64.51161487524436]
Self-supervised pre-training has gained traction for learning generic code representations valuable for many downstream SE tasks.
We argue that it is also essential to factor in how developers code day-to-day for general-purpose representation learning.
In particular, we propose CONCORD, a self-supervised, contrastive learning strategy to place benign clones closer in the representation space while moving deviants further apart.
arXiv Detail & Related papers (2023-06-05T20:39:08Z) - Probing Semantic Grounding in Language Models of Code with
Representational Similarity Analysis [0.11470070927586018]
We propose using Representational Similarity Analysis to probe the semantic grounding in language models of code.
We probe representations from the CodeBERT model for semantic grounding by using the data from the IBM CodeNet dataset.
Our experiments with semantic perturbations in code reveal that CodeBERT is able to robustly distinguish between semantically correct and incorrect code.
arXiv Detail & Related papers (2022-07-15T19:04:43Z) - Evaluation of Contrastive Learning with Various Code Representations for
Code Clone Detection [3.699097874146491]
We evaluate contrastive learning for detecting semantic clones of code snippets.
We use CodeTransformator to create a dataset that mimics plagiarised code based on competitive programming solutions.
The results of our evaluation show that proposed models perform diversely in each task, however the performance of the graph-based models is generally above the others.
arXiv Detail & Related papers (2022-06-17T12:25:44Z) - Enhancing Semantic Code Search with Multimodal Contrastive Learning and
Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search.
We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z) - COSEA: Convolutional Code Search with Layer-wise Attention [90.35777733464354]
We propose a new deep learning architecture, COSEA, which leverages convolutional neural networks with layer-wise attention to capture the code's intrinsic structural logic.
COSEA can achieve significant improvements over state-of-the-art methods on code search tasks.
arXiv Detail & Related papers (2020-10-19T13:53:38Z) - Semantic Clone Detection via Probabilistic Software Modeling [69.43451204725324]
This article contributes a semantic clone detection approach that detects clones that have 0% syntactic similarity.
We present SCD-PSM as a stable and precise solution to semantic clone detection.
arXiv Detail & Related papers (2020-08-11T17:54:20Z) - A Transformer-based Approach for Source Code Summarization [86.08359401867577]
We learn code representation for summarization by modeling the pairwise relationship between code tokens.
We show that despite the approach is simple, it outperforms the state-of-the-art techniques by a significant margin.
arXiv Detail & Related papers (2020-05-01T23:29:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.