Evaluating few shot and Contrastive learning Methods for Code Clone
Detection
- URL: http://arxiv.org/abs/2204.07501v3
- Date: Thu, 9 Nov 2023 18:30:35 GMT
- Title: Evaluating few shot and Contrastive learning Methods for Code Clone
Detection
- Authors: Mohamad Khajezade, Fatemeh Hendijani Fard and Mohamed S. Shehata
- Abstract summary: Code Clone Detection is a software engineering task that is used for plagiarism detection, code search, and code comprehension.
Deep learning-based models have achieved an F1 score (a metric used to assess classifiers) of $sim$95% on the CodeXGLUE benchmark.
No previous study evaluates the generalizability of these models where a limited amount of annotated data is available.
- Score: 5.1623866691702744
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Context: Code Clone Detection (CCD) is a software engineering task that is
used for plagiarism detection, code search, and code comprehension. Recently,
deep learning-based models have achieved an F1 score (a metric used to assess
classifiers) of $\sim$95\% on the CodeXGLUE benchmark. These models require
many training data, mainly fine-tuned on Java or C++ datasets. However, no
previous study evaluates the generalizability of these models where a limited
amount of annotated data is available.
Objective: The main objective of this research is to assess the ability of
the CCD models as well as few shot learning algorithms for unseen programming
problems and new languages (i.e., the model is not trained on these
problems/languages).
Method: We assess the generalizability of the state of the art models for CCD
in few shot settings (i.e., only a few samples are available for fine-tuning)
by setting three scenarios: i) unseen problems, ii) unseen languages, iii)
combination of new languages and new problems. We choose three datasets of
BigCloneBench, POJ-104, and CodeNet and Java, C++, and Ruby languages. Then, we
employ Model Agnostic Meta-learning (MAML), where the model learns a
meta-learner capable of extracting transferable knowledge from the train set;
so that the model can be fine-tuned using a few samples. Finally, we combine
contrastive learning with MAML to further study whether it can improve the
results of MAML.
Related papers
- Language Models are Better Bug Detector Through Code-Pair Classification [0.26107298043931204]
In this paper, we propose code-pair classification task in which both the buggy and non-buggy versions are given to the model, and the model identifies the buggy ones.
Experiments indicate that an LLM can often pick the buggy from the non-buggy version of the code, and the code-pair classification task is much easier compared to be given a snippet and deciding if and where a bug exists.
arXiv Detail & Related papers (2023-11-14T07:20:57Z) - Context-Aware Meta-Learning [52.09326317432577]
We propose a meta-learning algorithm that emulates Large Language Models by learning new visual concepts during inference without fine-tuning.
Our approach exceeds or matches the state-of-the-art algorithm, P>M>F, on 8 out of 11 meta-learning benchmarks.
arXiv Detail & Related papers (2023-10-17T03:35:27Z) - Large Language Model-Aware In-Context Learning for Code Generation [75.68709482932903]
Large language models (LLMs) have shown impressive in-context learning (ICL) ability in code generation.
We propose a novel learning-based selection approach named LAIL (LLM-Aware In-context Learning) for code generation.
arXiv Detail & Related papers (2023-10-15T06:12:58Z) - On the Steganographic Capacity of Selected Learning Models [1.0640226829362012]
We consider the question of the steganographic capacity of learning models.
For a wide range of models, we determine the number of low-order bits that can be overwritten.
Of the models tested, the steganographic capacity ranges from 7.04 KB for our LR experiments, to 44.74 MB for InceptionV3.
arXiv Detail & Related papers (2023-08-29T10:41:34Z) - CodeGen2: Lessons for Training LLMs on Programming and Natural Languages [116.74407069443895]
We unify encoder and decoder-based models into a single prefix-LM.
For learning methods, we explore the claim of a "free lunch" hypothesis.
For data distributions, the effect of a mixture distribution and multi-epoch training of programming and natural languages on model performance is explored.
arXiv Detail & Related papers (2023-05-03T17:55:25Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - An Understanding-Oriented Robust Machine Reading Comprehension Model [12.870425062204035]
We propose an understanding-oriented machine reading comprehension model to address three kinds of robustness issues.
Specifically, we first use a natural language inference module to help the model understand the accurate semantic meanings of input questions.
Third, we propose a multilanguage learning mechanism to address the issue of generalization.
arXiv Detail & Related papers (2022-07-01T03:32:02Z) - Meta Learning for Code Summarization [10.403206672504664]
We show that three SOTA models for code summarization work well on largely disjoint subsets of a large code-base.
We propose three meta-models that select the best candidate summary for a given code segment.
arXiv Detail & Related papers (2022-01-20T17:23:34Z) - ALT-MAS: A Data-Efficient Framework for Active Testing of Machine
Learning Algorithms [58.684954492439424]
We propose a novel framework to efficiently test a machine learning model using only a small amount of labeled test data.
The idea is to estimate the metrics of interest for a model-under-test using Bayesian neural network (BNN)
arXiv Detail & Related papers (2021-04-11T12:14:04Z) - RethinkCWS: Is Chinese Word Segmentation a Solved Task? [81.11161697133095]
The performance of the Chinese Word (CWS) systems has gradually reached a plateau with the rapid development of deep neural networks.
In this paper, we take stock of what we have achieved and rethink what's left in the CWS task.
arXiv Detail & Related papers (2020-11-13T11:07:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.