Generalizability of Code Clone Detection on CodeBERT
- URL: http://arxiv.org/abs/2208.12588v1
- Date: Fri, 26 Aug 2022 11:24:20 GMT
- Title: Generalizability of Code Clone Detection on CodeBERT
- Authors: Tim Sonnekalb, Bernd Gruner, Clemens-Alexander Brust, Patrick M\"ader
- Abstract summary: Transformer networks such as CodeBERT already achieve outstanding results for code clone detection in benchmark datasets.
We show that the generalizability of CodeBERT decreases by evaluating two different subsets of Java code clones from BigCloneBench.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer networks such as CodeBERT already achieve outstanding results for
code clone detection in benchmark datasets, so one could assume that this task
has already been solved. However, code clone detection is not a trivial task.
Semantic code clones, in particular, are challenging to detect. We show that
the generalizability of CodeBERT decreases by evaluating two different subsets
of Java code clones from BigCloneBench. We observe a significant drop in F1
score when we evaluate different code snippets and functionality IDs than those
used for model building.
Related papers
- Assessing the Code Clone Detection Capability of Large Language Models [0.0]
The evaluation involves testing the models on a variety of code pairs of different clone types and levels of similarity.
Findings indicate that GPT-4 consistently surpasses GPT-3.5 across all clone types.
arXiv Detail & Related papers (2024-07-02T16:20:44Z) - CC2Vec: Combining Typed Tokens with Contrastive Learning for Effective Code Clone Detection [20.729032739935132]
CC2Vec is a novel code encoding method designed to swiftly identify simple code clones.
We evaluate CC2Vec on two widely used datasets (i.e., BigCloneBench and Google Code Jam)
arXiv Detail & Related papers (2024-05-01T10:18:31Z) - SparseCoder: Identifier-Aware Sparse Transformer for File-Level Code
Summarization [51.67317895094664]
This paper studies file-level code summarization, which can assist programmers in understanding and maintaining large source code projects.
We propose SparseCoder, an identifier-aware sparse transformer for effectively handling long code sequences.
arXiv Detail & Related papers (2024-01-26T09:23:27Z) - Who Made This Copy? An Empirical Analysis of Code Clone Authorship [1.1512593234650217]
We analyzed the authorship of code clones at the line-level granularity for Java files in 153 Apache projects stored on GitHub.
We found that there are a substantial number of clone lines across all projects.
One-third of clone sets are primarily contributed to by multiple leading authors.
arXiv Detail & Related papers (2023-09-03T08:24:32Z) - CONCORD: Clone-aware Contrastive Learning for Source Code [64.51161487524436]
Self-supervised pre-training has gained traction for learning generic code representations valuable for many downstream SE tasks.
We argue that it is also essential to factor in how developers code day-to-day for general-purpose representation learning.
In particular, we propose CONCORD, a self-supervised, contrastive learning strategy to place benign clones closer in the representation space while moving deviants further apart.
arXiv Detail & Related papers (2023-06-05T20:39:08Z) - ReCode: Robustness Evaluation of Code Generation Models [90.10436771217243]
We propose ReCode, a comprehensive robustness evaluation benchmark for code generation models.
We customize over 30 transformations specifically for code on docstrings, function and variable names, code syntax, and code format.
With human annotators, we verified that over 90% of the perturbed prompts do not alter the semantic meaning of the original prompt.
arXiv Detail & Related papers (2022-12-20T14:11:31Z) - Evaluation of Contrastive Learning with Various Code Representations for
Code Clone Detection [3.699097874146491]
We evaluate contrastive learning for detecting semantic clones of code snippets.
We use CodeTransformator to create a dataset that mimics plagiarised code based on competitive programming solutions.
The results of our evaluation show that proposed models perform diversely in each task, however the performance of the graph-based models is generally above the others.
arXiv Detail & Related papers (2022-06-17T12:25:44Z) - Enhancing Semantic Code Search with Multimodal Contrastive Learning and
Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search.
We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z) - Semantic Clone Detection via Probabilistic Software Modeling [69.43451204725324]
This article contributes a semantic clone detection approach that detects clones that have 0% syntactic similarity.
We present SCD-PSM as a stable and precise solution to semantic clone detection.
arXiv Detail & Related papers (2020-08-11T17:54:20Z) - Contrastive Code Representation Learning [95.86686147053958]
We show that the popular reconstruction-based BERT model is sensitive to source code edits, even when the edits preserve semantics.
We propose ContraCode: a contrastive pre-training task that learns code functionality, not form.
arXiv Detail & Related papers (2020-07-09T17:59:06Z) - Detecting Code Clones with Graph Neural Networkand Flow-Augmented
Abstract Syntax Tree [30.484662671342935]
We build a graph representation of programs called flow-augmented abstract syntax tree (FA-AST)
We apply two different types of graph neural networks on FA-AST to measure the similarity of code pairs.
Our approach outperforms the state-of-the-art approaches on both Google Code Jam and BigCloneBench tasks.
arXiv Detail & Related papers (2020-02-20T10:18:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.