ZC3: Zero-Shot Cross-Language Code Clone Detection
- URL: http://arxiv.org/abs/2308.13754v2
- Date: Thu, 7 Sep 2023 11:22:59 GMT
- Title: ZC3: Zero-Shot Cross-Language Code Clone Detection
- Authors: Jia Li, Chongyang Tao, Zhi Jin, Fang Liu, Jia Li, Ge Li
- Abstract summary: We propose a novel method named ZC3 for Zero-shot Cross-language Code Clone detection.
ZC3 designs the contrastive snippet prediction to form an isomorphic representation space among different programming languages.
Based on this, ZC3 exploits domain-aware learning and cycle consistency learning to generate representations that are aligned among different languages are diacritical for different types of clones.
- Score: 79.53514630357876
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Developers introduce code clones to improve programming productivity. Many
existing studies have achieved impressive performance in monolingual code clone
detection. However, during software development, more and more developers write
semantically equivalent programs with different languages to support different
platforms and help developers translate projects from one language to another.
Considering that collecting cross-language parallel data, especially for
low-resource languages, is expensive and time-consuming, how designing an
effective cross-language model that does not rely on any parallel data is a
significant problem. In this paper, we propose a novel method named ZC3 for
Zero-shot Cross-language Code Clone detection. ZC3 designs the contrastive
snippet prediction to form an isomorphic representation space among different
programming languages. Based on this, ZC3 exploits domain-aware learning and
cycle consistency learning to further constrain the model to generate
representations that are aligned among different languages meanwhile are
diacritical for different types of clones. To evaluate our approach, we conduct
extensive experiments on four representative cross-language clone detection
datasets. Experimental results show that ZC3 outperforms the state-of-the-art
baselines by 67.12%, 51.39%, 14.85%, and 53.01% on the MAP score, respectively.
We further investigate the representational distribution of different languages
and discuss the effectiveness of our method.
Related papers
- Development and Benchmarking of Multilingual Code Clone Detector [2.253851493296371]
multilingual code clone detectors make it easier to add new language support by providing syntax information of the target language only.
We propose a multilingual code block extraction method based on ANTLR generation and implement a multilingual code clone detector (MSCCD)
Compared to ten state-of-the-art detectors, MSCCD performs at an average level while it also supports a significantly larger number of languages.
arXiv Detail & Related papers (2024-09-10T03:08:33Z) - Large Language Models for cross-language code clone detection [3.5202378300682162]
Cross-lingual code clone detection has gained traction with the software engineering community.
Inspired by the significant advances in machine learning, this paper revisits cross-lingual code clone detection.
arXiv Detail & Related papers (2024-08-08T12:57:14Z) - The Role of Language Imbalance in Cross-lingual Generalisation: Insights from Cloned Language Experiments [57.273662221547056]
In this study, we investigate an unintuitive novel driver of cross-lingual generalisation: language imbalance.
We observe that the existence of a predominant language during training boosts the performance of less frequent languages.
As we extend our analysis to real languages, we find that infrequent languages still benefit from frequent ones, yet whether language imbalance causes cross-lingual generalisation there is not conclusive.
arXiv Detail & Related papers (2024-04-11T17:58:05Z) - AdaCCD: Adaptive Semantic Contrasts Discovery Based Cross Lingual
Adaptation for Code Clone Detection [69.79627042058048]
AdaCCD is a novel cross-lingual adaptation method that can detect cloned codes in a new language without annotations in that language.
We evaluate the cross-lingual adaptation results of AdaCCD by constructing a multilingual code clone detection benchmark consisting of 5 programming languages.
arXiv Detail & Related papers (2023-11-13T12:20:48Z) - Unveiling the potential of large language models in generating semantic
and cross-language clones [8.791710193028905]
OpenAI's GPT model has potential in such clone generation as GPT is used for text generation.
In the realm of semantic clones, GPT-3 attains an impressive accuracy of 62.14% and 0.55 BLEU score, achieved through few-shot prompt engineering.
arXiv Detail & Related papers (2023-09-12T17:40:49Z) - CCT-Code: Cross-Consistency Training for Multilingual Clone Detection
and Code Search [4.192584020959536]
We formulate the multilingual clone detection problem and present XCD, a new benchmark dataset produced from the CodeForces submissions dataset.
We present a novel training procedure, called cross-consistency training (CCT), that we apply to train language models on source code in different programming languages.
The resulting CCT-LM model achieves new state of the art, outperforming existing approaches on the POJ-104 clone detection benchmark with 95.67% MAP and AdvTest code search benchmark with 47.18% MRR.
arXiv Detail & Related papers (2023-05-19T12:09:49Z) - Multi-lingual Evaluation of Code Generation Models [82.7357812992118]
We present new benchmarks on evaluation code generation models: MBXP and Multilingual HumanEval, and MathQA-X.
These datasets cover over 10 programming languages.
We are able to assess the performance of code generation models in a multi-lingual fashion.
arXiv Detail & Related papers (2022-10-26T17:17:06Z) - Zero-Shot Cross-lingual Semantic Parsing [56.95036511882921]
We study cross-lingual semantic parsing as a zero-shot problem without parallel data for 7 test languages.
We propose a multi-task encoder-decoder model to transfer parsing knowledge to additional languages using only English-Logical form paired data.
Our system frames zero-shot parsing as a latent-space alignment problem and finds that pre-trained models can be improved to generate logical forms with minimal cross-lingual transfer penalty.
arXiv Detail & Related papers (2021-04-15T16:08:43Z) - XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating
Cross-lingual Generalization [128.37244072182506]
Cross-lingual TRansfer Evaluation of Multilinguals XTREME is a benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks.
We demonstrate that while models tested on English reach human performance on many tasks, there is still a sizable gap in the performance of cross-lingually transferred models.
arXiv Detail & Related papers (2020-03-24T19:09:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.