Unveiling the potential of large language models in generating semantic
and cross-language clones
- URL: http://arxiv.org/abs/2309.06424v1
- Date: Tue, 12 Sep 2023 17:40:49 GMT
- Title: Unveiling the potential of large language models in generating semantic
and cross-language clones
- Authors: Palash R. Roy, Ajmain I. Alam, Farouq Al-omari, Banani Roy, Chanchal
K. Roy, Kevin A. Schneider
- Abstract summary: OpenAI's GPT model has potential in such clone generation as GPT is used for text generation.
In the realm of semantic clones, GPT-3 attains an impressive accuracy of 62.14% and 0.55 BLEU score, achieved through few-shot prompt engineering.
- Score: 8.791710193028905
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Semantic and Cross-language code clone generation may be useful for code
reuse, code comprehension, refactoring and benchmarking. OpenAI's GPT model has
potential in such clone generation as GPT is used for text generation. When
developers copy/paste codes from Stack Overflow (SO) or within a system, there
might be inconsistent changes leading to unexpected behaviours. Similarly, if
someone possesses a code snippet in a particular programming language but seeks
equivalent functionality in a different language, a semantic cross-language
code clone generation approach could provide valuable assistance.In this study,
using SemanticCloneBench as a vehicle, we evaluated how well the GPT-3 model
could help generate semantic and cross-language clone variants for a given
fragment.We have comprised a diverse set of code fragments and assessed GPT-3s
performance in generating code variants.Through extensive experimentation and
analysis, where 9 judges spent 158 hours to validate, we investigate the
model's ability to produce accurate and semantically correct variants. Our
findings shed light on GPT-3's strengths in code generation, offering insights
into the potential applications and challenges of using advanced language
models in software development. Our quantitative analysis yields compelling
results. In the realm of semantic clones, GPT-3 attains an impressive accuracy
of 62.14% and 0.55 BLEU score, achieved through few-shot prompt engineering.
Furthermore, the model shines in transcending linguistic confines, boasting an
exceptional 91.25% accuracy in generating cross-language clones
Related papers
- Development and Benchmarking of Multilingual Code Clone Detector [2.253851493296371]
multilingual code clone detectors make it easier to add new language support by providing syntax information of the target language only.
We propose a multilingual code block extraction method based on ANTLR generation and implement a multilingual code clone detector (MSCCD)
Compared to ten state-of-the-art detectors, MSCCD performs at an average level while it also supports a significantly larger number of languages.
arXiv Detail & Related papers (2024-09-10T03:08:33Z) - Large Language Models for cross-language code clone detection [3.5202378300682162]
Cross-lingual code clone detection has gained traction with the software engineering community.
Inspired by the significant advances in machine learning, this paper revisits cross-lingual code clone detection.
arXiv Detail & Related papers (2024-08-08T12:57:14Z) - Assessing the Code Clone Detection Capability of Large Language Models [0.0]
The evaluation involves testing the models on a variety of code pairs of different clone types and levels of similarity.
Findings indicate that GPT-4 consistently surpasses GPT-3.5 across all clone types.
arXiv Detail & Related papers (2024-07-02T16:20:44Z) - Decoding at the Speed of Thought: Harnessing Parallel Decoding of Lexical Units for LLMs [57.27982780697922]
Large language models have demonstrated exceptional capability in natural language understanding and generation.
However, their generation speed is limited by the inherently sequential nature of their decoding process.
This paper introduces Lexical Unit Decoding, a novel decoding methodology implemented in a data-driven manner.
arXiv Detail & Related papers (2024-05-24T04:35:13Z) - AdaCCD: Adaptive Semantic Contrasts Discovery Based Cross Lingual
Adaptation for Code Clone Detection [69.79627042058048]
AdaCCD is a novel cross-lingual adaptation method that can detect cloned codes in a new language without annotations in that language.
We evaluate the cross-lingual adaptation results of AdaCCD by constructing a multilingual code clone detection benchmark consisting of 5 programming languages.
arXiv Detail & Related papers (2023-11-13T12:20:48Z) - CodeFuse-13B: A Pretrained Multi-lingual Code Large Language Model [58.127534002232096]
This paper introduces CodeFuse-13B, an open-sourced pre-trained code LLM.
It is specifically designed for code-related tasks with both English and Chinese prompts.
CodeFuse achieves its effectiveness by utilizing a high quality pre-training dataset.
arXiv Detail & Related papers (2023-10-10T02:38:44Z) - GPTCloneBench: A comprehensive benchmark of semantic clones and
cross-language clones using GPT-3 model and SemanticCloneBench [1.8687918300580921]
We present a comprehensive semantic clone and cross-language clone benchmark, GPTCloneBench by exploiting SemanticCloneBench and OpenAI's GPT-3 model.
From 79,928 clone pairs of GPT-3 output, we created a benchmark with 37,149 true semantic clone pairs, 19,288 false semantic pairs(Type-1/Type-2), and 20,770 cross-language clones across four languages (Java, C, C#, and Python)
arXiv Detail & Related papers (2023-08-26T21:50:34Z) - ZC3: Zero-Shot Cross-Language Code Clone Detection [79.53514630357876]
We propose a novel method named ZC3 for Zero-shot Cross-language Code Clone detection.
ZC3 designs the contrastive snippet prediction to form an isomorphic representation space among different programming languages.
Based on this, ZC3 exploits domain-aware learning and cycle consistency learning to generate representations that are aligned among different languages are diacritical for different types of clones.
arXiv Detail & Related papers (2023-08-26T03:48:10Z) - Multi-lingual Evaluation of Code Generation Models [82.7357812992118]
We present new benchmarks on evaluation code generation models: MBXP and Multilingual HumanEval, and MathQA-X.
These datasets cover over 10 programming languages.
We are able to assess the performance of code generation models in a multi-lingual fashion.
arXiv Detail & Related papers (2022-10-26T17:17:06Z) - Zero-Shot Cross-lingual Semantic Parsing [56.95036511882921]
We study cross-lingual semantic parsing as a zero-shot problem without parallel data for 7 test languages.
We propose a multi-task encoder-decoder model to transfer parsing knowledge to additional languages using only English-Logical form paired data.
Our system frames zero-shot parsing as a latent-space alignment problem and finds that pre-trained models can be improved to generate logical forms with minimal cross-lingual transfer penalty.
arXiv Detail & Related papers (2021-04-15T16:08:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.