CC2Vec: Combining Typed Tokens with Contrastive Learning for Effective Code Clone Detection
- URL: http://arxiv.org/abs/2405.00428v1
- Date: Wed, 1 May 2024 10:18:31 GMT
- Title: CC2Vec: Combining Typed Tokens with Contrastive Learning for Effective Code Clone Detection
- Authors: Shihan Dou, Yueming Wu, Haoxiang Jia, Yuhao Zhou, Yan Liu, Yang Liu,
- Abstract summary: CC2Vec is a novel code encoding method designed to swiftly identify simple code clones.
We evaluate CC2Vec on two widely used datasets (i.e., BigCloneBench and Google Code Jam)
- Score: 20.729032739935132
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the development of the open source community, the code is often copied, spread, and evolved in multiple software systems, which brings uncertainty and risk to the software system (e.g., bug propagation and copyright infringement). Therefore, it is important to conduct code clone detection to discover similar code pairs. Many approaches have been proposed to detect code clones where token-based tools can scale to big code. However, due to the lack of program details, they cannot handle more complicated code clones, i.e., semantic code clones. In this paper, we introduce CC2Vec, a novel code encoding method designed to swiftly identify simple code clones while also enhancing the capability for semantic code clone detection. To retain the program details between tokens, CC2Vec divides them into different categories (i.e., typed tokens) according to the syntactic types and then applies two self-attention mechanism layers to encode them. To resist changes in the code structure of semantic code clones, CC2Vec performs contrastive learning to reduce the differences introduced by different code implementations. We evaluate CC2Vec on two widely used datasets (i.e., BigCloneBench and Google Code Jam) and the results report that our method can effectively detect simple code clones. In addition, CC2Vec not only attains comparable performance to widely used semantic code clone detection systems such as ASTNN, SCDetector, and FCCA by simply fine-tuning, but also significantly surpasses these methods in both detection efficiency.
Related papers
- Rethinking Code Refinement: Learning to Judge Code Efficiency [60.04718679054704]
Large Language Models (LLMs) have demonstrated impressive capabilities in understanding and generating codes.
We propose a novel method based on the code language model that is trained to judge the efficiency between two different codes.
We validate our method on multiple programming languages with multiple refinement steps, demonstrating that the proposed method can effectively distinguish between more and less efficient versions of code.
arXiv Detail & Related papers (2024-10-29T06:17:37Z) - Uncovering LLM-Generated Code: A Zero-Shot Synthetic Code Detector via Code Rewriting [78.48355455324688]
We propose a novel zero-shot synthetic code detector based on the similarity between the code and its rewritten variants.
Our results demonstrate a notable enhancement over existing synthetic content detectors designed for general texts.
arXiv Detail & Related papers (2024-05-25T08:57:28Z) - SparseCoder: Identifier-Aware Sparse Transformer for File-Level Code
Summarization [51.67317895094664]
This paper studies file-level code summarization, which can assist programmers in understanding and maintaining large source code projects.
We propose SparseCoder, an identifier-aware sparse transformer for effectively handling long code sequences.
arXiv Detail & Related papers (2024-01-26T09:23:27Z) - AdaCCD: Adaptive Semantic Contrasts Discovery Based Cross Lingual
Adaptation for Code Clone Detection [69.79627042058048]
AdaCCD is a novel cross-lingual adaptation method that can detect cloned codes in a new language without annotations in that language.
We evaluate the cross-lingual adaptation results of AdaCCD by constructing a multilingual code clone detection benchmark consisting of 5 programming languages.
arXiv Detail & Related papers (2023-11-13T12:20:48Z) - Towards Understanding the Capability of Large Language Models on Code
Clone Detection: A Survey [40.99060616674878]
Large language models (LLMs) possess diverse code-related knowledge, making them versatile for various software engineering challenges.
This paper provides the first comprehensive evaluation of LLMs for clone detection, covering different clone types, languages, and prompts.
We find advanced LLMs excel in detecting complex semantic clones, surpassing existing methods.
arXiv Detail & Related papers (2023-08-02T14:56:01Z) - CONCORD: Clone-aware Contrastive Learning for Source Code [64.51161487524436]
Self-supervised pre-training has gained traction for learning generic code representations valuable for many downstream SE tasks.
We argue that it is also essential to factor in how developers code day-to-day for general-purpose representation learning.
In particular, we propose CONCORD, a self-supervised, contrastive learning strategy to place benign clones closer in the representation space while moving deviants further apart.
arXiv Detail & Related papers (2023-06-05T20:39:08Z) - Generalizability of Code Clone Detection on CodeBERT [0.0]
Transformer networks such as CodeBERT already achieve outstanding results for code clone detection in benchmark datasets.
We show that the generalizability of CodeBERT decreases by evaluating two different subsets of Java code clones from BigCloneBench.
arXiv Detail & Related papers (2022-08-26T11:24:20Z) - Revisiting Code Search in a Two-Stage Paradigm [67.02322603435628]
TOSS is a two-stage fusion code search framework.
It first uses IR-based and bi-encoder models to efficiently recall a small number of top-k code candidates.
It then uses fine-grained cross-encoders for finer ranking.
arXiv Detail & Related papers (2022-08-24T02:34:27Z) - Evaluation of Contrastive Learning with Various Code Representations for
Code Clone Detection [3.699097874146491]
We evaluate contrastive learning for detecting semantic clones of code snippets.
We use CodeTransformator to create a dataset that mimics plagiarised code based on competitive programming solutions.
The results of our evaluation show that proposed models perform diversely in each task, however the performance of the graph-based models is generally above the others.
arXiv Detail & Related papers (2022-06-17T12:25:44Z) - Detecting Code Clones with Graph Neural Networkand Flow-Augmented
Abstract Syntax Tree [30.484662671342935]
We build a graph representation of programs called flow-augmented abstract syntax tree (FA-AST)
We apply two different types of graph neural networks on FA-AST to measure the similarity of code pairs.
Our approach outperforms the state-of-the-art approaches on both Google Code Jam and BigCloneBench tasks.
arXiv Detail & Related papers (2020-02-20T10:18:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.