Source Code Clone Detection Using Unsupervised Similarity Measures
- URL: http://arxiv.org/abs/2401.09885v3
- Date: Tue, 6 Feb 2024 15:09:13 GMT
- Title: Source Code Clone Detection Using Unsupervised Similarity Measures
- Authors: Jorge Martinez-Gil
- Abstract summary: This work presents a comparative analysis of unsupervised similarity measures for identifying source code clone detection.
The goal is to overview the current state-of-the-art techniques, their strengths, and weaknesses.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Assessing similarity in source code has gained significant attention in
recent years due to its importance in software engineering tasks such as clone
detection and code search and recommendation. This work presents a comparative
analysis of unsupervised similarity measures for identifying source code clone
detection. The goal is to overview the current state-of-the-art techniques,
their strengths, and weaknesses. To do that, we compile the existing
unsupervised strategies and evaluate their performance on a benchmark dataset
to guide software engineers in selecting appropriate methods for their specific
use cases. The source code of this study is available at
https://github.com/jorge-martinez-gil/codesim
Related papers
- Codev-Bench: How Do LLMs Understand Developer-Centric Code Completion? [60.84912551069379]
We present the Code-Development Benchmark (Codev-Bench), a fine-grained, real-world, repository-level, and developer-centric evaluation framework.
Codev-Agent is an agent-based system that automates repository crawling, constructs execution environments, extracts dynamic calling chains from existing unit tests, and generates new test samples to avoid data leakage.
arXiv Detail & Related papers (2024-10-02T09:11:10Z) - DOCE: Finding the Sweet Spot for Execution-Based Code Generation [69.5305729627198]
We propose a comprehensive framework that includes candidate generation, $n$-best reranking, minimum Bayes risk (MBR) decoding, and self-ging as the core components.
Our findings highlight the importance of execution-based methods and the difference gap between execution-based and execution-free methods.
arXiv Detail & Related papers (2024-08-25T07:10:36Z) - Advanced Detection of Source Code Clones via an Ensemble of Unsupervised Similarity Measures [0.0]
This research introduces a novel ensemble learning approach for code similarity assessment.
The key idea is that the strengths of a diverse set of similarity measures can complement each other and mitigate individual weaknesses.
arXiv Detail & Related papers (2024-05-03T13:42:49Z) - Between Lines of Code: Unraveling the Distinct Patterns of Machine and Human Programmers [14.018844722021896]
We study the specific patterns that characterize machine- and human-authored code.
We propose DetectCodeGPT, a novel method for detecting machine-generated code.
arXiv Detail & Related papers (2024-01-12T09:15:20Z) - Zero-Shot Detection of Machine-Generated Codes [83.0342513054389]
This work proposes a training-free approach for the detection of LLMs-generated codes.
We find that existing training-based or zero-shot text detectors are ineffective in detecting code.
Our method exhibits robustness against revision attacks and generalizes well to Java codes.
arXiv Detail & Related papers (2023-10-08T10:08:21Z) - CONCORD: Clone-aware Contrastive Learning for Source Code [64.51161487524436]
Self-supervised pre-training has gained traction for learning generic code representations valuable for many downstream SE tasks.
We argue that it is also essential to factor in how developers code day-to-day for general-purpose representation learning.
In particular, we propose CONCORD, a self-supervised, contrastive learning strategy to place benign clones closer in the representation space while moving deviants further apart.
arXiv Detail & Related papers (2023-06-05T20:39:08Z) - Malicious Source Code Detection Using Transformer [0.0]
We introduce Malicious Source code Detection using Transformers (MSDT) algorithm.
MSDT is a novel static analysis based on a deep learning method that detects real-world code injection cases to source code packages.
Our algorithm is capable of detecting functions that were injected with malicious code with precision@k values of up to 0.909.
arXiv Detail & Related papers (2022-09-16T14:16:50Z) - Evaluation of Contrastive Learning with Various Code Representations for
Code Clone Detection [3.699097874146491]
We evaluate contrastive learning for detecting semantic clones of code snippets.
We use CodeTransformator to create a dataset that mimics plagiarised code based on competitive programming solutions.
The results of our evaluation show that proposed models perform diversely in each task, however the performance of the graph-based models is generally above the others.
arXiv Detail & Related papers (2022-06-17T12:25:44Z) - Enhancing Semantic Code Search with Multimodal Contrastive Learning and
Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search.
We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z) - Code Clone Detection based on Event Embedding and Event Dependency [7.652540019496754]
We propose a code clone detection method based on semantic similarity.
By treating code as a series of interdependent events that occur continuously, we design a model namely EDAM to encode code semantic information.
Experimental results show that our EDAM model is superior to state-the-art open source models for code clone detection.
arXiv Detail & Related papers (2021-11-28T15:50:15Z) - A Transformer-based Approach for Source Code Summarization [86.08359401867577]
We learn code representation for summarization by modeling the pairwise relationship between code tokens.
We show that despite the approach is simple, it outperforms the state-of-the-art techniques by a significant margin.
arXiv Detail & Related papers (2020-05-01T23:29:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.