How the Misuse of a Dataset Harmed Semantic Clone Detection
- URL: http://arxiv.org/abs/2505.04311v1
- Date: Wed, 07 May 2025 10:52:28 GMT
- Title: How the Misuse of a Dataset Harmed Semantic Clone Detection
- Authors: Jens Krinke, Chaiyong Ragkhitwetsagul,
- Abstract summary: This paper demonstrates that BigCloneBench is problematic to use as ground truth for learning or evaluating semantic code similarity.<n>In a literature review of 179 papers that use BigCloneBench as a dataset, we found 139 papers that used BigCloneBench to evaluate semantic clone detection.<n>We emphasise that using BigCloneBench remains valid for the intended purpose of evaluating syntactic or textual clone detection of Type-1, Type-2, and Type-3 clones.
- Score: 0.9361474110798144
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: BigCloneBench is a well-known and widely used large-scale dataset for the evaluation of recall of clone detection tools. It has been beneficial for research on clone detection and has become a standard in evaluating the performance of clone detection tools. More recently, it has also been widely used as a dataset to evaluate machine learning approaches to semantic clone detection or code similarity detection for functional or semantic similarity. This paper demonstrates that BigCloneBench is problematic to use as ground truth for learning or evaluating semantic code similarity, and highlights the aspects of BigCloneBench that affect the ground truth quality. A manual investigation of a statistically significant random sample of 406 Weak Type-3/Type-4 clone pairs revealed that 93% of them do not have a similar functionality and are therefore mislabelled. In a literature review of 179 papers that use BigCloneBench as a dataset, we found 139 papers that used BigCloneBench to evaluate semantic clone detection and where the results are threatened in their validity by the mislabelling. As such, these papers often report high F1 scores (e.g., above 0.9), which indicates overfitting to dataset-specific artefacts rather than genuine semantic similarity detection. We emphasise that using BigCloneBench remains valid for the intended purpose of evaluating syntactic or textual clone detection of Type-1, Type-2, and Type-3 clones. We acknowledge the important contributions of BigCloneBench to two decades of traditional clone detection research. However, the usage of BigCloneBench beyond the intended purpose without careful consideration of its limitations has led to misleading results and conclusions, and potentially harmed the field of semantic clone detection.
Related papers
- CBW: Towards Dataset Ownership Verification for Speaker Verification via Clustering-based Backdoor Watermarking [85.68235482145091]
Large-scale speech datasets have become valuable intellectual property.<n>We propose a novel dataset ownership verification method.<n>Our approach introduces a clustering-based backdoor watermark (CBW)<n>We conduct extensive experiments on benchmark datasets, verifying the effectiveness and robustness of our method against potential adaptive attacks.
arXiv Detail & Related papers (2025-03-02T02:02:57Z) - Fuzzy Granule Density-Based Outlier Detection with Multi-Scale Granular Balls [65.44462297594308]
Outlier detection refers to the identification of anomalous samples that deviate significantly from the distribution of normal data.<n>Most unsupervised outlier detection methods are carefully designed to detect specified outliers.<n>We propose a fuzzy rough sets-based multi-scale outlier detection method to identify various types of outliers.
arXiv Detail & Related papers (2025-01-06T12:35:51Z) - On the Use of Deep Learning Models for Semantic Clone Detection [4.796947520072581]
We propose a multi-step evaluation approach for five state-of-the-art clone detection models leveraging existing benchmark datasets.<n>Specifically, we examine three highly-performing single-language models (ASTNN, GMN, CodeBERT) on BigCloneBench, SemanticCloneBench, and GPTCloneBench.<n>While single-language models show high F1 scores for BigCloneBench, their performance on SemanticCloneBench varies (up to 20%)<n>Interestingly, the cross-language model (C4) shows superior performance (around 7%) on SemanticCloneBench over other models.
arXiv Detail & Related papers (2024-12-19T11:15:02Z) - C2P-CLIP: Injecting Category Common Prompt in CLIP to Enhance Generalization in Deepfake Detection [98.34703790782254]
We introduce Category Common Prompt CLIP, which integrates the category common prompt into the text encoder to inject category-related concepts into the image encoder.<n>Our method achieves a 12.41% improvement in detection accuracy compared to the original CLIP, without introducing additional parameters during testing.
arXiv Detail & Related papers (2024-08-19T02:14:25Z) - SimClone: Detecting Tabular Data Clones using Value Similarity [37.85935189975307]
Presence of data clones between datasets can cause issues when using datasets with clones to build AI software.
We propose a novel method called SimClone for data clone detection in tabular datasets without relying on structural information.
Our results show that our SimClone outperforms the current state-of-the-art method by at least 20% in terms of both F1-score and AUC.
arXiv Detail & Related papers (2024-06-24T04:16:32Z) - Using a Nearest-Neighbour, BERT-Based Approach for Scalable Clone
Detection [0.0]
SSCD is a BERT-based clone detection approach that targets high recall of Type 3 and Type 4 clones at scale.
It does so by computing a representative embedding for each code fragment and finding similar fragments using a nearest neighbour search.
This paper details the approach and an empirical assessment towards configuring and evaluating that approach in industrial setting.
arXiv Detail & Related papers (2023-09-05T12:38:55Z) - GPTCloneBench: A comprehensive benchmark of semantic clones and
cross-language clones using GPT-3 model and SemanticCloneBench [1.8687918300580921]
We present a comprehensive semantic clone and cross-language clone benchmark, GPTCloneBench by exploiting SemanticCloneBench and OpenAI's GPT-3 model.
From 79,928 clone pairs of GPT-3 output, we created a benchmark with 37,149 true semantic clone pairs, 19,288 false semantic pairs(Type-1/Type-2), and 20,770 cross-language clones across four languages (Java, C, C#, and Python)
arXiv Detail & Related papers (2023-08-26T21:50:34Z) - Evaluation of Contrastive Learning with Various Code Representations for
Code Clone Detection [3.699097874146491]
We evaluate contrastive learning for detecting semantic clones of code snippets.
We use CodeTransformator to create a dataset that mimics plagiarised code based on competitive programming solutions.
The results of our evaluation show that proposed models perform diversely in each task, however the performance of the graph-based models is generally above the others.
arXiv Detail & Related papers (2022-06-17T12:25:44Z) - ORDisCo: Effective and Efficient Usage of Incremental Unlabeled Data for
Semi-supervised Continual Learning [52.831894583501395]
Continual learning assumes the incoming data are fully labeled, which might not be applicable in real applications.
We propose deep Online Replay with Discriminator Consistency (ORDisCo) to interdependently learn a classifier with a conditional generative adversarial network (GAN)
We show ORDisCo achieves significant performance improvement on various semi-supervised learning benchmark datasets for SSCL.
arXiv Detail & Related papers (2021-01-02T09:04:14Z) - Detection of Adversarial Supports in Few-shot Classifiers Using Feature
Preserving Autoencoders and Self-Similarity [89.26308254637702]
We propose a detection strategy to highlight adversarial support sets.
We make use of feature preserving autoencoder filtering and also the concept of self-similarity of a support set to perform this detection.
Our method is attack-agnostic and also the first to explore detection for few-shot classifiers to the best of our knowledge.
arXiv Detail & Related papers (2020-12-09T14:13:41Z) - Semantic Clone Detection via Probabilistic Software Modeling [69.43451204725324]
This article contributes a semantic clone detection approach that detects clones that have 0% syntactic similarity.
We present SCD-PSM as a stable and precise solution to semantic clone detection.
arXiv Detail & Related papers (2020-08-11T17:54:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.