StoneDetector: Conventional and versatile code clone detection for Java
- URL: http://arxiv.org/abs/2508.03435v1
- Date: Tue, 05 Aug 2025 13:23:13 GMT
- Title: StoneDetector: Conventional and versatile code clone detection for Java
- Authors: Thomas S. Heinze, André Schäfer, Wolfram Amme,
- Abstract summary: StoneDetector implements a conventional clone detection approach based upon the textual comparison of paths.<n>We show StoneDetector's performance and scalability in finding code clones in both, Java source and Bytecode.
- Score: 0.5480144998735542
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Copy & paste is a widespread practice when developing software and, thus, duplicated and subsequently modified code occurs frequently in software projects. Since such code clones, i.e., identical or similar fragments of code, can bloat software projects and cause issues like bug or vulnerability propagation, their identification is of importance. In this paper, we present the StoneDetector platform and its underlying method for finding code clones in Java source and Bytecode. StoneDetector implements a conventional clone detection approach based upon the textual comparison of paths derived from the code's representation by dominator trees. In this way, the tool does not only find exact and syntactically similar near-miss code clones, but also code clones that are harder to detect due to their larger variety in the syntax. We demonstrate StoneDetector's versatility as a conventional clone detection platform and analyze its various available configuration parameters, including the usage of different string metrics, hashing algorithms, etc. In our exhaustive evaluation with other conventional clone detectors on several state-of-the-art benchmarks, we can show StoneDetector's performance and scalability in finding code clones in both, Java source and Bytecode.
Related papers
- Industrial-Scale Neural Network Clone Detection with Disk-Based Similarity Search [0.24091079613649843]
Code clones are similar code fragments that often arise from copy-and-paste programming.<n>We extend existing neural network-based clone detection schemes to handle clones that far exceed available memory.<n>We demonstrate that our approach is around 2$times$ slower than the in-memory approach for a problem size that can fit within memory.
arXiv Detail & Related papers (2025-04-24T22:50:23Z) - CC2Vec: Combining Typed Tokens with Contrastive Learning for Effective Code Clone Detection [20.729032739935132]
CC2Vec is a novel code encoding method designed to swiftly identify simple code clones.
We evaluate CC2Vec on two widely used datasets (i.e., BigCloneBench and Google Code Jam)
arXiv Detail & Related papers (2024-05-01T10:18:31Z) - Gitor: Scalable Code Clone Detection by Building Global Sample Graph [11.041017540277558]
We propose Gitor to capture the underlying connections among different code samples.
Gitor has higher accuracy in terms of code clone detection and excellent execution time for inputs of various sizes.
arXiv Detail & Related papers (2023-11-15T08:48:50Z) - AdaCCD: Adaptive Semantic Contrasts Discovery Based Cross Lingual
Adaptation for Code Clone Detection [69.79627042058048]
AdaCCD is a novel cross-lingual adaptation method that can detect cloned codes in a new language without annotations in that language.
We evaluate the cross-lingual adaptation results of AdaCCD by constructing a multilingual code clone detection benchmark consisting of 5 programming languages.
arXiv Detail & Related papers (2023-11-13T12:20:48Z) - Zero-Shot Detection of Machine-Generated Codes [83.0342513054389]
This work proposes a training-free approach for the detection of LLMs-generated codes.
We find that existing training-based or zero-shot text detectors are ineffective in detecting code.
Our method exhibits robustness against revision attacks and generalizes well to Java codes.
arXiv Detail & Related papers (2023-10-08T10:08:21Z) - Who Made This Copy? An Empirical Analysis of Code Clone Authorship [1.1512593234650217]
We analyzed the authorship of code clones at the line-level granularity for Java files in 153 Apache projects stored on GitHub.
We found that there are a substantial number of clone lines across all projects.
One-third of clone sets are primarily contributed to by multiple leading authors.
arXiv Detail & Related papers (2023-09-03T08:24:32Z) - ZC3: Zero-Shot Cross-Language Code Clone Detection [79.53514630357876]
We propose a novel method named ZC3 for Zero-shot Cross-language Code Clone detection.
ZC3 designs the contrastive snippet prediction to form an isomorphic representation space among different programming languages.
Based on this, ZC3 exploits domain-aware learning and cycle consistency learning to generate representations that are aligned among different languages are diacritical for different types of clones.
arXiv Detail & Related papers (2023-08-26T03:48:10Z) - CONCORD: Clone-aware Contrastive Learning for Source Code [64.51161487524436]
Self-supervised pre-training has gained traction for learning generic code representations valuable for many downstream SE tasks.
We argue that it is also essential to factor in how developers code day-to-day for general-purpose representation learning.
In particular, we propose CONCORD, a self-supervised, contrastive learning strategy to place benign clones closer in the representation space while moving deviants further apart.
arXiv Detail & Related papers (2023-06-05T20:39:08Z) - Evaluation of Contrastive Learning with Various Code Representations for
Code Clone Detection [3.699097874146491]
We evaluate contrastive learning for detecting semantic clones of code snippets.
We use CodeTransformator to create a dataset that mimics plagiarised code based on competitive programming solutions.
The results of our evaluation show that proposed models perform diversely in each task, however the performance of the graph-based models is generally above the others.
arXiv Detail & Related papers (2022-06-17T12:25:44Z) - Nearest neighbor search with compact codes: A decoder perspective [77.60612610421101]
We re-interpret popular methods such as binary hashing or product quantizers as auto-encoders.
We design backward-compatible decoders that improve the reconstruction of the vectors from the same codes.
arXiv Detail & Related papers (2021-12-17T15:22:28Z) - Semantic Clone Detection via Probabilistic Software Modeling [69.43451204725324]
This article contributes a semantic clone detection approach that detects clones that have 0% syntactic similarity.
We present SCD-PSM as a stable and precise solution to semantic clone detection.
arXiv Detail & Related papers (2020-08-11T17:54:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.