An Empirical Analysis of Git Commit Logs for Potential Inconsistency in Code Clones
- URL: http://arxiv.org/abs/2409.08555v1
- Date: Fri, 13 Sep 2024 06:14:50 GMT
- Title: An Empirical Analysis of Git Commit Logs for Potential Inconsistency in Code Clones
- Authors: Reishi Yokomori, Katsuro Inoue,
- Abstract summary: We analyzed 45 repositories owned by the Apache Software Foundation on GitHub.
On average, clone snippets are changed infrequently, typically only two or three times throughout their lifetime.
The ratio of co-changes is about half of all clone changes.
- Score: 0.9745141082552166
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Code clones are code snippets that are identical or similar to other snippets within the same or different files. They are often created through copy-and-paste practices and modified during development and maintenance activities. Since a pair of code clones, known as a clone pair, has a possible logical coupling between them, it is expected that changes to each snippet are made simultaneously (co-changed) and consistently. There is extensive research on code clones, including studies related to the co-change of clones; however, detailed analysis of commit logs for code clone pairs has been limited. In this paper, we investigate the commit logs of code snippets from clone pairs, using the git-log command to extract changes to cloned code snippets. We analyzed 45 repositories owned by the Apache Software Foundation on GitHub and addressed three research questions regarding commit frequency, co-change ratio, and commit patterns. Our findings indicate that (1) on average, clone snippets are changed infrequently, typically only two or three times throughout their lifetime, (2) the ratio of co-changes is about half of all clone changes, with 10-20\% of co-changed commits being concerning (potentially inconsistent), and (3) 35-65\% of all clone pairs being classified as concerning clone pairs (potentially inconsistent clone pairs). These results suggest the need for a consistent management system through the commit timeline of clones.
Related papers
- CC2Vec: Combining Typed Tokens with Contrastive Learning for Effective Code Clone Detection [20.729032739935132]
CC2Vec is a novel code encoding method designed to swiftly identify simple code clones.
We evaluate CC2Vec on two widely used datasets (i.e., BigCloneBench and Google Code Jam)
arXiv Detail & Related papers (2024-05-01T10:18:31Z) - Unraveling Code Clone Dynamics in Deep Learning Frameworks [0.7285835869818668]
Deep Learning (DL) frameworks play a critical role in advancing artificial intelligence, and their rapid growth underscores the need for a comprehensive understanding of software quality and maintainability.
Code clones refer to identical or highly similar source code fragments within the same project or even across different projects.
We empirically analyze code clones in nine popular DL frameworks, i.e. Paddle, PyTorch, Aesara, Ray, MXNet, Keras, Jax and BentoML.
arXiv Detail & Related papers (2024-04-25T21:12:35Z) - Who Made This Copy? An Empirical Analysis of Code Clone Authorship [1.1512593234650217]
We analyzed the authorship of code clones at the line-level granularity for Java files in 153 Apache projects stored on GitHub.
We found that there are a substantial number of clone lines across all projects.
One-third of clone sets are primarily contributed to by multiple leading authors.
arXiv Detail & Related papers (2023-09-03T08:24:32Z) - ZC3: Zero-Shot Cross-Language Code Clone Detection [79.53514630357876]
We propose a novel method named ZC3 for Zero-shot Cross-language Code Clone detection.
ZC3 designs the contrastive snippet prediction to form an isomorphic representation space among different programming languages.
Based on this, ZC3 exploits domain-aware learning and cycle consistency learning to generate representations that are aligned among different languages are diacritical for different types of clones.
arXiv Detail & Related papers (2023-08-26T03:48:10Z) - InterCode: Standardizing and Benchmarking Interactive Coding with
Execution Feedback [50.725076393314964]
We introduce InterCode, a lightweight, flexible, and easy-to-use framework of interactive coding as a standard reinforcement learning environment.
Our framework is language and platform agnostic, uses self-contained Docker environments to provide safe and reproducible execution.
We demonstrate InterCode's viability as a testbed by evaluating multiple state-of-the-art LLMs configured with different prompting strategies.
arXiv Detail & Related papers (2023-06-26T17:59:50Z) - CONCORD: Clone-aware Contrastive Learning for Source Code [64.51161487524436]
Self-supervised pre-training has gained traction for learning generic code representations valuable for many downstream SE tasks.
We argue that it is also essential to factor in how developers code day-to-day for general-purpose representation learning.
In particular, we propose CONCORD, a self-supervised, contrastive learning strategy to place benign clones closer in the representation space while moving deviants further apart.
arXiv Detail & Related papers (2023-06-05T20:39:08Z) - RepoCoder: Repository-Level Code Completion Through Iterative Retrieval
and Generation [96.75695811963242]
RepoCoder is a framework to streamline the repository-level code completion process.
It incorporates a similarity-based retriever and a pre-trained code language model.
It consistently outperforms the vanilla retrieval-augmented code completion approach.
arXiv Detail & Related papers (2023-03-22T13:54:46Z) - Generalizability of Code Clone Detection on CodeBERT [0.0]
Transformer networks such as CodeBERT already achieve outstanding results for code clone detection in benchmark datasets.
We show that the generalizability of CodeBERT decreases by evaluating two different subsets of Java code clones from BigCloneBench.
arXiv Detail & Related papers (2022-08-26T11:24:20Z) - Evaluation of Contrastive Learning with Various Code Representations for
Code Clone Detection [3.699097874146491]
We evaluate contrastive learning for detecting semantic clones of code snippets.
We use CodeTransformator to create a dataset that mimics plagiarised code based on competitive programming solutions.
The results of our evaluation show that proposed models perform diversely in each task, however the performance of the graph-based models is generally above the others.
arXiv Detail & Related papers (2022-06-17T12:25:44Z) - Faster Person Re-Identification [68.22203008760269]
We introduce a new solution for fast ReID by formulating a novel Coarse-to-Fine hashing code search strategy.
It uses shorter codes to coarsely rank broad matching similarities and longer codes to refine only a few top candidates for more accurate instance ReID.
Experimental results on 2 datasets show that our proposed method (CtF) is not only 8% more accurate but also 5x faster than contemporary hashing ReID methods.
arXiv Detail & Related papers (2020-08-16T03:02:49Z) - Semantic Clone Detection via Probabilistic Software Modeling [69.43451204725324]
This article contributes a semantic clone detection approach that detects clones that have 0% syntactic similarity.
We present SCD-PSM as a stable and precise solution to semantic clone detection.
arXiv Detail & Related papers (2020-08-11T17:54:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.