AdaCCD: Adaptive Semantic Contrasts Discovery Based Cross Lingual
Adaptation for Code Clone Detection
- URL: http://arxiv.org/abs/2311.07277v2
- Date: Wed, 6 Mar 2024 17:46:50 GMT
- Title: AdaCCD: Adaptive Semantic Contrasts Discovery Based Cross Lingual
Adaptation for Code Clone Detection
- Authors: Yangkai Du, Tengfei Ma, Lingfei Wu, Xuhong Zhang, Shouling Ji
- Abstract summary: AdaCCD is a novel cross-lingual adaptation method that can detect cloned codes in a new language without annotations in that language.
We evaluate the cross-lingual adaptation results of AdaCCD by constructing a multilingual code clone detection benchmark consisting of 5 programming languages.
- Score: 69.79627042058048
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Code Clone Detection, which aims to retrieve functionally similar programs
from large code bases, has been attracting increasing attention. Modern
software often involves a diverse range of programming languages. However,
current code clone detection methods are generally limited to only a few
popular programming languages due to insufficient annotated data as well as
their own model design constraints. To address these issues, we present AdaCCD,
a novel cross-lingual adaptation method that can detect cloned codes in a new
language without annotations in that language. AdaCCD leverages
language-agnostic code representations from pre-trained programming language
models and propose an Adaptively Refined Contrastive Learning framework to
transfer knowledge from resource-rich languages to resource-poor languages. We
evaluate the cross-lingual adaptation results of AdaCCD by constructing a
multilingual code clone detection benchmark consisting of 5 programming
languages. AdaCCD achieves significant improvements over other baselines, and
achieve comparable performance to supervised fine-tuning.
Related papers
- Development and Benchmarking of Multilingual Code Clone Detector [2.253851493296371]
multilingual code clone detectors make it easier to add new language support by providing syntax information of the target language only.
We propose a multilingual code block extraction method based on ANTLR generation and implement a multilingual code clone detector (MSCCD)
Compared to ten state-of-the-art detectors, MSCCD performs at an average level while it also supports a significantly larger number of languages.
arXiv Detail & Related papers (2024-09-10T03:08:33Z) - Large Language Models for cross-language code clone detection [3.5202378300682162]
Cross-lingual code clone detection has gained traction with the software engineering community.
Inspired by the significant advances in machine learning, this paper revisits cross-lingual code clone detection.
arXiv Detail & Related papers (2024-08-08T12:57:14Z) - DA-Net: A Disentangled and Adaptive Network for Multi-Source
Cross-Lingual Transfer Learning [11.78085199896157]
Multi-Source cross-lingual transfer learning deals with the transfer of task knowledge from multiple labelled source languages to an unlabeled target language under the language shift.
We propose a Disentangled and Adaptive Network (DA-Net) to address these challenges.
arXiv Detail & Related papers (2024-03-07T02:30:46Z) - Language Agnostic Code Embeddings [61.84835551549612]
We focus on the cross-lingual capabilities of code embeddings across different programming languages.
Code embeddings comprise two distinct components: one deeply tied to the nuances and syntax of a specific language, and the other remaining agnostic to these details.
We show that when we isolate and eliminate this language-specific component, we witness significant improvements in downstream code retrieval tasks.
arXiv Detail & Related papers (2023-10-25T17:34:52Z) - ZC3: Zero-Shot Cross-Language Code Clone Detection [79.53514630357876]
We propose a novel method named ZC3 for Zero-shot Cross-language Code Clone detection.
ZC3 designs the contrastive snippet prediction to form an isomorphic representation space among different programming languages.
Based on this, ZC3 exploits domain-aware learning and cycle consistency learning to generate representations that are aligned among different languages are diacritical for different types of clones.
arXiv Detail & Related papers (2023-08-26T03:48:10Z) - CCT-Code: Cross-Consistency Training for Multilingual Clone Detection
and Code Search [4.192584020959536]
We formulate the multilingual clone detection problem and present XCD, a new benchmark dataset produced from the CodeForces submissions dataset.
We present a novel training procedure, called cross-consistency training (CCT), that we apply to train language models on source code in different programming languages.
The resulting CCT-LM model achieves new state of the art, outperforming existing approaches on the POJ-104 clone detection benchmark with 95.67% MAP and AdvTest code search benchmark with 47.18% MRR.
arXiv Detail & Related papers (2023-05-19T12:09:49Z) - Multi-lingual Evaluation of Code Generation Models [82.7357812992118]
We present new benchmarks on evaluation code generation models: MBXP and Multilingual HumanEval, and MathQA-X.
These datasets cover over 10 programming languages.
We are able to assess the performance of code generation models in a multi-lingual fashion.
arXiv Detail & Related papers (2022-10-26T17:17:06Z) - Zero-Shot Cross-lingual Semantic Parsing [56.95036511882921]
We study cross-lingual semantic parsing as a zero-shot problem without parallel data for 7 test languages.
We propose a multi-task encoder-decoder model to transfer parsing knowledge to additional languages using only English-Logical form paired data.
Our system frames zero-shot parsing as a latent-space alignment problem and finds that pre-trained models can be improved to generate logical forms with minimal cross-lingual transfer penalty.
arXiv Detail & Related papers (2021-04-15T16:08:43Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.