CCT-Code: Cross-Consistency Training for Multilingual Clone Detection
and Code Search
- URL: http://arxiv.org/abs/2305.11626v1
- Date: Fri, 19 May 2023 12:09:49 GMT
- Title: CCT-Code: Cross-Consistency Training for Multilingual Clone Detection
and Code Search
- Authors: Nikita Sorokin, Dmitry Abulkhanov, Sergey Nikolenko, Valentin Malykh
- Abstract summary: We formulate the multilingual clone detection problem and present XCD, a new benchmark dataset produced from the CodeForces submissions dataset.
We present a novel training procedure, called cross-consistency training (CCT), that we apply to train language models on source code in different programming languages.
The resulting CCT-LM model achieves new state of the art, outperforming existing approaches on the POJ-104 clone detection benchmark with 95.67% MAP and AdvTest code search benchmark with 47.18% MRR.
- Score: 4.192584020959536
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the clone detection and information retrieval problems for source
code, well-known tasks important for any programming language. Although it is
also an important and interesting problem to find code snippets that operate
identically but are written in different programming languages, to the best of
our knowledge multilingual clone detection has not been studied in literature.
In this work, we formulate the multilingual clone detection problem and present
XCD, a new benchmark dataset produced from the CodeForces submissions dataset.
Moreover, we present a novel training procedure, called cross-consistency
training (CCT), that we apply to train language models on source code in
different programming languages. The resulting CCT-LM model, initialized with
GraphCodeBERT and fine-tuned with CCT, achieves new state of the art,
outperforming existing approaches on the POJ-104 clone detection benchmark with
95.67\% MAP and AdvTest code search benchmark with 47.18\% MRR; it also shows
the best results on the newly created multilingual clone detection benchmark
XCD across all programming languages.
Related papers
- Development and Benchmarking of Multilingual Code Clone Detector [2.253851493296371]
multilingual code clone detectors make it easier to add new language support by providing syntax information of the target language only.
We propose a multilingual code block extraction method based on ANTLR generation and implement a multilingual code clone detector (MSCCD)
Compared to ten state-of-the-art detectors, MSCCD performs at an average level while it also supports a significantly larger number of languages.
arXiv Detail & Related papers (2024-09-10T03:08:33Z) - CRUXEval-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution [50.7413285637879]
The CRUXEVAL-X code reasoning benchmark contains 19 programming languages.
It comprises at least 600 subjects for each language, along with 19K content-consistent tests in total.
Even a model trained solely on Python can achieve at most 34.4% Pass@1 in other languages.
arXiv Detail & Related papers (2024-08-23T11:43:00Z) - Large Language Models for cross-language code clone detection [3.5202378300682162]
Cross-lingual code clone detection has gained traction with the software engineering community.
Inspired by the significant advances in machine learning, this paper revisits cross-lingual code clone detection.
arXiv Detail & Related papers (2024-08-08T12:57:14Z) - AdaCCD: Adaptive Semantic Contrasts Discovery Based Cross Lingual
Adaptation for Code Clone Detection [69.79627042058048]
AdaCCD is a novel cross-lingual adaptation method that can detect cloned codes in a new language without annotations in that language.
We evaluate the cross-lingual adaptation results of AdaCCD by constructing a multilingual code clone detection benchmark consisting of 5 programming languages.
arXiv Detail & Related papers (2023-11-13T12:20:48Z) - ZC3: Zero-Shot Cross-Language Code Clone Detection [79.53514630357876]
We propose a novel method named ZC3 for Zero-shot Cross-language Code Clone detection.
ZC3 designs the contrastive snippet prediction to form an isomorphic representation space among different programming languages.
Based on this, ZC3 exploits domain-aware learning and cycle consistency learning to generate representations that are aligned among different languages are diacritical for different types of clones.
arXiv Detail & Related papers (2023-08-26T03:48:10Z) - Evaluation of Contrastive Learning with Various Code Representations for
Code Clone Detection [3.699097874146491]
We evaluate contrastive learning for detecting semantic clones of code snippets.
We use CodeTransformator to create a dataset that mimics plagiarised code based on competitive programming solutions.
The results of our evaluation show that proposed models perform diversely in each task, however the performance of the graph-based models is generally above the others.
arXiv Detail & Related papers (2022-06-17T12:25:44Z) - Zero-Shot Cross-lingual Semantic Parsing [56.95036511882921]
We study cross-lingual semantic parsing as a zero-shot problem without parallel data for 7 test languages.
We propose a multi-task encoder-decoder model to transfer parsing knowledge to additional languages using only English-Logical form paired data.
Our system frames zero-shot parsing as a latent-space alignment problem and finds that pre-trained models can be improved to generate logical forms with minimal cross-lingual transfer penalty.
arXiv Detail & Related papers (2021-04-15T16:08:43Z) - X-FACTR: Multilingual Factual Knowledge Retrieval from Pretrained
Language Models [103.75890012041366]
Language models (LMs) have proven surprisingly successful at capturing factual knowledge.
However, studies on LMs' factual representation ability have almost invariably been performed on English.
We create a benchmark of cloze-style probes for 23 typologically diverse languages.
arXiv Detail & Related papers (2020-10-13T05:29:56Z) - FILTER: An Enhanced Fusion Method for Cross-lingual Language
Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.