Semantic Clone Detection via Probabilistic Software Modeling
- URL: http://arxiv.org/abs/2008.04891v2
- Date: Sat, 21 May 2022 15:55:34 GMT
- Title: Semantic Clone Detection via Probabilistic Software Modeling
- Authors: Hannes Thaller, Lukas Linsbauer, and Alexander Egyed
- Abstract summary: This article contributes a semantic clone detection approach that detects clones that have 0% syntactic similarity.
We present SCD-PSM as a stable and precise solution to semantic clone detection.
- Score: 69.43451204725324
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Semantic clone detection is the process of finding program elements with
similar or equal runtime behavior. For example, detecting the semantic equality
between the recursive and iterative implementation of the factorial
computation. Semantic clone detection is the de facto technical boundary of
clone detectors. In recent years, this boundary has been tested using
interesting new approaches. This article contributes a semantic clone detection
approach that detects clones that have 0% syntactic similarity. We present
Semantic Clone Detection via Probabilistic Software Modeling (SCD-PSM) as a
stable and precise solution to semantic clone detection. PSM builds a
probabilistic model of a program that is capable of evaluating and generating
runtime data. SCD-PSM leverages this model and its model elements for finding
behaviorally equal model elements. This behavioral equality is then generalized
to semantic equality of the original program elements. It uses the likelihood
between model elements as a distance metric. Then, it employs the likelihood
ratio significance test to decide whether this distance is significant, given a
pre-specified and controllable false-positive rate. The output of SCD-PSM are
pairs of program elements (i.e., methods), their distance, and a decision on
whether they are clones or not. SCD-PSM yields excellent results with a
Matthews Correlation Coefficient greater than 0.9. These results are obtained
on classical semantic clone detection problems such as detecting recursive and
iterative versions of an algorithm, but also on complex problems used in coding
competitions.
Related papers
- Using Ensemble Inference to Improve Recall of Clone Detection [0.0]
Large-scale source-code clone detection is a challenging task.
We employ four state-of-the-art neural network models and evaluate them individually/in combination.
The results, on an illustrative dataset of approximately 500K lines of C/C++ code, suggest ensemble inference outperforms individual models in all trialled cases.
arXiv Detail & Related papers (2024-02-12T09:44:59Z) - Learning to Bound Counterfactual Inference in Structural Causal Models
from Observational and Randomised Data [64.96984404868411]
We derive a likelihood characterisation for the overall data that leads us to extend a previous EM-based algorithm.
The new algorithm learns to approximate the (unidentifiability) region of model parameters from such mixed data sources.
It delivers interval approximations to counterfactual results, which collapse to points in the identifiable case.
arXiv Detail & Related papers (2022-12-06T12:42:11Z) - Rapid Person Re-Identification via Sub-space Consistency Regularization [51.76876061721556]
Person Re-Identification (ReID) matches pedestrians across disjoint cameras.
Existing ReID methods adopting real-value feature descriptors have achieved high accuracy, but they are low in efficiency due to the slow Euclidean distance computation.
We propose a novel Sub-space Consistency Regularization (SCR) algorithm that can speed up the ReID procedure by 0.25$ times.
arXiv Detail & Related papers (2022-07-13T02:44:05Z) - Evaluation of Contrastive Learning with Various Code Representations for
Code Clone Detection [3.699097874146491]
We evaluate contrastive learning for detecting semantic clones of code snippets.
We use CodeTransformator to create a dataset that mimics plagiarised code based on competitive programming solutions.
The results of our evaluation show that proposed models perform diversely in each task, however the performance of the graph-based models is generally above the others.
arXiv Detail & Related papers (2022-06-17T12:25:44Z) - Alternating Mahalanobis Distance Minimization for Stable and Accurate CP
Decomposition [4.847980206213335]
We introduce a new formulation for deriving singular values and vectors of a tensor by considering the critical points of a function different from what is used in the previous work.
We show that a subsweep of this algorithm can achieve a superlinear convergence rate for exact CPD with known rank.
We then view the algorithm as optimizing a Mahalanobis distance with respect to each factor with ground metric dependent on the other factors.
arXiv Detail & Related papers (2022-04-14T19:56:36Z) - Sublinear Time Approximation of Text Similarity Matrices [50.73398637380375]
We introduce a generalization of the popular Nystr"om method to the indefinite setting.
Our algorithm can be applied to any similarity matrix and runs in sublinear time in the size of the matrix.
We show that our method, along with a simple variant of CUR decomposition, performs very well in approximating a variety of similarity matrices.
arXiv Detail & Related papers (2021-12-17T17:04:34Z) - Code Clone Detection based on Event Embedding and Event Dependency [7.652540019496754]
We propose a code clone detection method based on semantic similarity.
By treating code as a series of interdependent events that occur continuously, we design a model namely EDAM to encode code semantic information.
Experimental results show that our EDAM model is superior to state-the-art open source models for code clone detection.
arXiv Detail & Related papers (2021-11-28T15:50:15Z) - A greedy reconstruction algorithm for the identification of spin
distribution [0.0]
We show that the identifiability of a piecewise constant approximation of the probability distribution is related to the invertibility of a matrix.
The algorithm aims to design specific controls which ensure that this matrix is as far as possible from a singular matrix.
arXiv Detail & Related papers (2021-08-26T12:40:52Z) - Sparse PCA via $l_{2,p}$-Norm Regularization for Unsupervised Feature
Selection [138.97647716793333]
We propose a simple and efficient unsupervised feature selection method, by combining reconstruction error with $l_2,p$-norm regularization.
We present an efficient optimization algorithm to solve the proposed unsupervised model, and analyse the convergence and computational complexity of the algorithm theoretically.
arXiv Detail & Related papers (2020-12-29T04:08:38Z) - Consistency of a Recurrent Language Model With Respect to Incomplete
Decoding [67.54760086239514]
We study the issue of receiving infinite-length sequences from a recurrent language model.
We propose two remedies which address inconsistency: consistent variants of top-k and nucleus sampling, and a self-terminating recurrent language model.
arXiv Detail & Related papers (2020-02-06T19:56:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.