Related papers: A systematic literature review on source code similarity measurement and clone detection: techniques, applications, and challenges

A systematic literature review on source code similarity measurement and clone detection: techniques, applications, and challenges

URL: http://arxiv.org/abs/2306.16171v1
Date: Wed, 28 Jun 2023 12:49:22 GMT
Title: A systematic literature review on source code similarity measurement and clone detection: techniques, applications, and challenges
Authors: Morteza Zakeri-Nasrabadi and Saeed Parsa and Mohammad Ramezani and Chanchal Roy and Masoud Ekhtiarzadeh
Abstract summary: This paper proposes a systematic literature review and meta-analysis on code similarity measurement and evaluation techniques. A deep investigation reveals 80 software tools, working with eight different techniques on five application domains. The lack of reliable datasets, empirical evaluations, hybrid methods, and focuses on multi-paradigm languages are the main challenges in the field.
Score: 0.979963710164115
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Measuring and evaluating source code similarity is a fundamental software engineering activity that embraces a broad range of applications, including but not limited to code recommendation, duplicate code, plagiarism, malware, and smell detection. This paper proposes a systematic literature review and meta-analysis on code similarity measurement and evaluation techniques to shed light on the existing approaches and their characteristics in different applications. We initially found over 10000 articles by querying four digital libraries and ended up with 136 primary studies in the field. The studies were classified according to their methodology, programming languages, datasets, tools, and applications. A deep investigation reveals 80 software tools, working with eight different techniques on five application domains. Nearly 49% of the tools work on Java programs and 37% support C and C++, while there is no support for many programming languages. A noteworthy point was the existence of 12 datasets related to source code similarity measurement and duplicate codes, of which only eight datasets were publicly accessible. The lack of reliable datasets, empirical evaluations, hybrid methods, and focuses on multi-paradigm languages are the main challenges in the field. Emerging applications of code similarity measurement concentrate on the development phase in addition to the maintenance.

Related papers

Is Compression Really Linear with Code Intelligence? [60.123628177110206]
textitFormat Annealing is a lightweight, transparent training methodology designed to assess the intrinsic capabilities of pre-trained models equitably.<n>Our empirical results reveal a fundamental logarithmic relationship between measured code intelligence and bits-per-character (BPC)<n>Our work provides a more nuanced understanding of compression's role in developing code intelligence and contributes a robust evaluation framework in the code domain.
arXiv Detail & Related papers (2025-05-16T16:59:14Z)
CoIR: A Comprehensive Benchmark for Code Information Retrieval Models [56.691926887209895]
We present textbfname (textbfInformation textbfRetrieval Benchmark), a robust and comprehensive benchmark specifically designed to assess code retrieval capabilities. name comprises textbften meticulously curated code datasets, spanning textbfeight distinctive retrieval tasks across textbfseven diverse domains. We evaluate nine widely used retrieval models using name, uncovering significant difficulties in performing code retrieval tasks even with state-of-the-art systems.
arXiv Detail & Related papers (2024-07-03T07:58:20Z)
A Comprehensive Library for Benchmarking Multi-class Visual Anomaly Detection [52.228708947607636]
This paper introduces a comprehensive visual anomaly detection benchmark, ADer, which is a modular framework for new methods. The benchmark includes multiple datasets from industrial and medical domains, implementing fifteen state-of-the-art methods and nine comprehensive metrics. We objectively reveal the strengths and weaknesses of different methods and provide insights into the challenges and future directions of multi-class visual anomaly detection.
arXiv Detail & Related papers (2024-06-05T13:40:07Z)
Source Code Clone Detection Using Unsupervised Similarity Measures [0.0]
This work presents a comparative analysis of unsupervised similarity measures for identifying source code clone detection. The goal is to overview the current state-of-the-art techniques, their strengths, and weaknesses.
arXiv Detail & Related papers (2024-01-18T10:56:27Z)
Deep Learning for Code Intelligence: Survey, Benchmark and Toolkit [63.82016263181941]
Code intelligence leverages machine learning techniques to extract knowledge from extensive code corpora. Currently, there is already a thriving research community focusing on code intelligence.
arXiv Detail & Related papers (2023-12-30T17:48:37Z)
Deep Learning Based Code Generation Methods: Literature Review [30.17038624027751]
This paper focuses on Code Generation task that aims at generating relevant code fragments according to given natural language descriptions. In this paper, we systematically review the current work on deep learning-based code generation methods.
arXiv Detail & Related papers (2023-03-02T08:25:42Z)
Enhancing Semantic Code Search with Multimodal Contrastive Learning and Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search. We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z)
Learning Program Semantics with Code Representations: An Empirical Study [22.953964699210296]
Program semantics learning is the core and fundamental for various code intelligent tasks. We categorize current mainstream code representation techniques into four categories. We evaluate its performance on three diverse and popular code intelligent tasks.
arXiv Detail & Related papers (2022-03-22T14:51:44Z)
Using Document Similarity Methods to create Parallel Datasets for Code Translation [60.36392618065203]
Translating source code from one programming language to another is a critical, time-consuming task. We propose to use document similarity methods to create noisy parallel datasets of code. We show that these models perform comparably to models trained on ground truth for reasonable levels of noise.
arXiv Detail & Related papers (2021-10-11T17:07:58Z)
Uncertainty Baselines: Benchmarks for Uncertainty & Robustness in Deep Learning [66.59455427102152]
We introduce Uncertainty Baselines: high-quality implementations of standard and state-of-the-art deep learning methods on a variety of tasks. Each baseline is a self-contained experiment pipeline with easily reusable and extendable components. We provide model checkpoints, experiment outputs as Python notebooks, and leaderboards for comparing results.
arXiv Detail & Related papers (2021-06-07T23:57:32Z)
Project CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks [11.10732802304274]
Project CodeNet consists of 14M code samples and about 500M lines of code in 55 different programming languages. Project CodeNet is not only unique in its scale, but also in the diversity of coding tasks it can help benchmark.
arXiv Detail & Related papers (2021-05-25T00:13:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.