Evaluation of Contrastive Learning with Various Code Representations for
Code Clone Detection
- URL: http://arxiv.org/abs/2206.08726v1
- Date: Fri, 17 Jun 2022 12:25:44 GMT
- Title: Evaluation of Contrastive Learning with Various Code Representations for
Code Clone Detection
- Authors: Maksim Zubkov, Egor Spirin, Egor Bogomolov, Timofey Bryksin
- Abstract summary: We evaluate contrastive learning for detecting semantic clones of code snippets.
We use CodeTransformator to create a dataset that mimics plagiarised code based on competitive programming solutions.
The results of our evaluation show that proposed models perform diversely in each task, however the performance of the graph-based models is generally above the others.
- Score: 3.699097874146491
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Code clones are pairs of code snippets that implement similar functionality.
Clone detection is a fundamental branch of automatic source code comprehension,
having many applications in refactoring recommendation, plagiarism detection,
and code summarization. A particularly interesting case of clone detection is
the detection of semantic clones, i.e., code snippets that have the same
functionality but significantly differ in implementation. A promising approach
to detecting semantic clones is contrastive learning (CL), a machine learning
paradigm popular in computer vision but not yet commonly adopted for code
processing.
Our work aims to evaluate the most popular CL algorithms combined with three
source code representations on two tasks. The first task is code clone
detection, which we evaluate on the POJ-104 dataset containing implementations
of 104 algorithms. The second task is plagiarism detection. To evaluate the
models on this task, we introduce CodeTransformator, a tool for transforming
source code. We use it to create a dataset that mimics plagiarised code based
on competitive programming solutions. We trained nine models for both tasks and
compared them with six existing approaches, including traditional tools and
modern pre-trained neural models. The results of our evaluation show that
proposed models perform diversely in each task, however the performance of the
graph-based models is generally above the others. Among CL algorithms, SimCLR
and SwAV lead to better results, while Moco is the most robust approach. Our
code and trained models are available at
https://doi.org/10.5281/zenodo.6360627, https://doi.org/10.5281/zenodo.5596345.
Related papers
- CC2Vec: Combining Typed Tokens with Contrastive Learning for Effective Code Clone Detection [20.729032739935132]
CC2Vec is a novel code encoding method designed to swiftly identify simple code clones.
We evaluate CC2Vec on two widely used datasets (i.e., BigCloneBench and Google Code Jam)
arXiv Detail & Related papers (2024-05-01T10:18:31Z) - Does Your Neural Code Completion Model Use My Code? A Membership Inference Approach [66.51005288743153]
We investigate the legal and ethical issues of current neural code completion models.
We tailor a membership inference approach (termed CodeMI) that was originally crafted for classification tasks.
We evaluate the effectiveness of this adapted approach across a diverse array of neural code completion models.
arXiv Detail & Related papers (2024-04-22T15:54:53Z) - Source Code is a Graph, Not a Sequence: A Cross-Lingual Perspective on
Code Clone Detection [3.3298891718069648]
We show that graph-based methods are more suitable for code clone detection than sequence-based methods.
We show that CodeGraph outperforms CodeBERT on both data-sets, especially on cross-lingual code clones.
arXiv Detail & Related papers (2023-12-27T09:30:31Z) - Zero-Shot Detection of Machine-Generated Codes [83.0342513054389]
This work proposes a training-free approach for the detection of LLMs-generated codes.
We find that existing training-based or zero-shot text detectors are ineffective in detecting code.
Our method exhibits robustness against revision attacks and generalizes well to Java codes.
arXiv Detail & Related papers (2023-10-08T10:08:21Z) - Towards Understanding the Capability of Large Language Models on Code
Clone Detection: A Survey [40.99060616674878]
Large language models (LLMs) possess diverse code-related knowledge, making them versatile for various software engineering challenges.
This paper provides the first comprehensive evaluation of LLMs for clone detection, covering different clone types, languages, and prompts.
We find advanced LLMs excel in detecting complex semantic clones, surpassing existing methods.
arXiv Detail & Related papers (2023-08-02T14:56:01Z) - CONCORD: Clone-aware Contrastive Learning for Source Code [64.51161487524436]
Self-supervised pre-training has gained traction for learning generic code representations valuable for many downstream SE tasks.
We argue that it is also essential to factor in how developers code day-to-day for general-purpose representation learning.
In particular, we propose CONCORD, a self-supervised, contrastive learning strategy to place benign clones closer in the representation space while moving deviants further apart.
arXiv Detail & Related papers (2023-06-05T20:39:08Z) - Enhancing Semantic Code Search with Multimodal Contrastive Learning and
Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search.
We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z) - CodeRetriever: Unimodal and Bimodal Contrastive Learning [128.06072658302165]
We propose the CodeRetriever model, which combines the unimodal and bimodal contrastive learning to train function-level code semantic representations.
For unimodal contrastive learning, we design a semantic-guided method to build positive code pairs based on the documentation and function name.
For bimodal contrastive learning, we leverage the documentation and in-line comments of code to build text-code pairs.
arXiv Detail & Related papers (2022-01-26T10:54:30Z) - Semantic Clone Detection via Probabilistic Software Modeling [69.43451204725324]
This article contributes a semantic clone detection approach that detects clones that have 0% syntactic similarity.
We present SCD-PSM as a stable and precise solution to semantic clone detection.
arXiv Detail & Related papers (2020-08-11T17:54:20Z) - Detecting Code Clones with Graph Neural Networkand Flow-Augmented
Abstract Syntax Tree [30.484662671342935]
We build a graph representation of programs called flow-augmented abstract syntax tree (FA-AST)
We apply two different types of graph neural networks on FA-AST to measure the similarity of code pairs.
Our approach outperforms the state-of-the-art approaches on both Google Code Jam and BigCloneBench tasks.
arXiv Detail & Related papers (2020-02-20T10:18:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.