Related papers: Evaluation of Contrastive Learning with Various Code Representations for Code Clone Detection

Evaluation of Contrastive Learning with Various Code Representations for Code Clone Detection

URL: http://arxiv.org/abs/2206.08726v1
Date: Fri, 17 Jun 2022 12:25:44 GMT
Title: Evaluation of Contrastive Learning with Various Code Representations for Code Clone Detection
Authors: Maksim Zubkov, Egor Spirin, Egor Bogomolov, Timofey Bryksin
Abstract summary: We evaluate contrastive learning for detecting semantic clones of code snippets. We use CodeTransformator to create a dataset that mimics plagiarised code based on competitive programming solutions. The results of our evaluation show that proposed models perform diversely in each task, however the performance of the graph-based models is generally above the others.
Score: 3.699097874146491
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Code clones are pairs of code snippets that implement similar functionality. Clone detection is a fundamental branch of automatic source code comprehension, having many applications in refactoring recommendation, plagiarism detection, and code summarization. A particularly interesting case of clone detection is the detection of semantic clones, i.e., code snippets that have the same functionality but significantly differ in implementation. A promising approach to detecting semantic clones is contrastive learning (CL), a machine learning paradigm popular in computer vision but not yet commonly adopted for code processing. Our work aims to evaluate the most popular CL algorithms combined with three source code representations on two tasks. The first task is code clone detection, which we evaluate on the POJ-104 dataset containing implementations of 104 algorithms. The second task is plagiarism detection. To evaluate the models on this task, we introduce CodeTransformator, a tool for transforming source code. We use it to create a dataset that mimics plagiarised code based on competitive programming solutions. We trained nine models for both tasks and compared them with six existing approaches, including traditional tools and modern pre-trained neural models. The results of our evaluation show that proposed models perform diversely in each task, however the performance of the graph-based models is generally above the others. Among CL algorithms, SimCLR and SwAV lead to better results, while Moco is the most robust approach. Our code and trained models are available at https://doi.org/10.5281/zenodo.6360627, https://doi.org/10.5281/zenodo.5596345.

Related papers

Evaluating Small-Scale Code Models for Code Clone Detection [0.0]
This research aims to measure the performance of several newly introduced small code models in classifying code pairs as clones or non-clones.<n>Most models performed well across standard metrics, including accuracy, precision, recall, and F1-score.<n>A marginal fraction of clones remains challenging to detect, especially when the code looks similar but performs different operations.
arXiv Detail & Related papers (2025-04-10T07:26:20Z)
CC2Vec: Combining Typed Tokens with Contrastive Learning for Effective Code Clone Detection [20.729032739935132]
CC2Vec is a novel code encoding method designed to swiftly identify simple code clones. We evaluate CC2Vec on two widely used datasets (i.e., BigCloneBench and Google Code Jam)
arXiv Detail & Related papers (2024-05-01T10:18:31Z)
Does Your Neural Code Completion Model Use My Code? A Membership Inference Approach [66.51005288743153]
We investigate the legal and ethical issues of current neural code completion models. We tailor a membership inference approach (termed CodeMI) that was originally crafted for classification tasks. We evaluate the effectiveness of this adapted approach across a diverse array of neural code completion models.
arXiv Detail & Related papers (2024-04-22T15:54:53Z)
Source Code is a Graph, Not a Sequence: A Cross-Lingual Perspective on Code Clone Detection [3.3298891718069648]
We show that graph-based methods are more suitable for code clone detection than sequence-based methods. We show that CodeGraph outperforms CodeBERT on both data-sets, especially on cross-lingual code clones.
arXiv Detail & Related papers (2023-12-27T09:30:31Z)
Zero-Shot Detection of Machine-Generated Codes [83.0342513054389]
This work proposes a training-free approach for the detection of LLMs-generated codes. We find that existing training-based or zero-shot text detectors are ineffective in detecting code. Our method exhibits robustness against revision attacks and generalizes well to Java codes.
arXiv Detail & Related papers (2023-10-08T10:08:21Z)
Towards Understanding the Capability of Large Language Models on Code Clone Detection: A Survey [40.99060616674878]
Large language models (LLMs) possess diverse code-related knowledge, making them versatile for various software engineering challenges. This paper provides the first comprehensive evaluation of LLMs for clone detection, covering different clone types, languages, and prompts. We find advanced LLMs excel in detecting complex semantic clones, surpassing existing methods.
arXiv Detail & Related papers (2023-08-02T14:56:01Z)
CONCORD: Clone-aware Contrastive Learning for Source Code [64.51161487524436]
Self-supervised pre-training has gained traction for learning generic code representations valuable for many downstream SE tasks. We argue that it is also essential to factor in how developers code day-to-day for general-purpose representation learning. In particular, we propose CONCORD, a self-supervised, contrastive learning strategy to place benign clones closer in the representation space while moving deviants further apart.
arXiv Detail & Related papers (2023-06-05T20:39:08Z)
CCT-Code: Cross-Consistency Training for Multilingual Clone Detection and Code Search [10.498419085787551]
We present a novel training procedure, cross-consistency training (CCT) leveraging cross-lingual similarity. We show that this training is effective both for encoder- and decoder-based models. The trained encoder-based CCT-LM model achieves a new state of the art on POJ-104 (monolingual C++ clone detection benchmark) with 96.73% MAP and AdvTest (monolingual Python code search benchmark) with 47.18% MRR.
arXiv Detail & Related papers (2023-05-19T12:09:49Z)
Enhancing Semantic Code Search with Multimodal Contrastive Learning and Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search. We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z)
CodeRetriever: Unimodal and Bimodal Contrastive Learning [128.06072658302165]
We propose the CodeRetriever model, which combines the unimodal and bimodal contrastive learning to train function-level code semantic representations. For unimodal contrastive learning, we design a semantic-guided method to build positive code pairs based on the documentation and function name. For bimodal contrastive learning, we leverage the documentation and in-line comments of code to build text-code pairs.
arXiv Detail & Related papers (2022-01-26T10:54:30Z)
Semantic Clone Detection via Probabilistic Software Modeling [69.43451204725324]
This article contributes a semantic clone detection approach that detects clones that have 0% syntactic similarity. We present SCD-PSM as a stable and precise solution to semantic clone detection.
arXiv Detail & Related papers (2020-08-11T17:54:20Z)
Detecting Code Clones with Graph Neural Networkand Flow-Augmented Abstract Syntax Tree [30.484662671342935]
We build a graph representation of programs called flow-augmented abstract syntax tree (FA-AST) We apply two different types of graph neural networks on FA-AST to measure the similarity of code pairs. Our approach outperforms the state-of-the-art approaches on both Google Code Jam and BigCloneBench tasks.
arXiv Detail & Related papers (2020-02-20T10:18:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.