Related papers: Using a Nearest-Neighbour, BERT-Based Approach for Scalable Clone Detection

Using a Nearest-Neighbour, BERT-Based Approach for Scalable Clone Detection

URL: http://arxiv.org/abs/2309.02182v1
Date: Tue, 5 Sep 2023 12:38:55 GMT
Title: Using a Nearest-Neighbour, BERT-Based Approach for Scalable Clone Detection
Authors: Muslim Chochlov (1), Gul Aftab Ahmed (2), James Vincent Patten (1), Guoxian Lu (3), Wei Hou (4), David Gregg (2), Jim Buckley (1) ((1) Deptment of Computer Science and Information Systems, University of Limerick, Ireland, (2) Deptment of Computer Science, Trinity College Dublin, Ireland, (3) WN Digital IPD and Trustworthiness Enabling, Huawei Technologies Co., Ltd., Shanghai, China, (4) Huawei Vulnerability Management Center, Huawei Technologies Co., Ltd., Shenzhen, Guangdong, China)
Abstract summary: SSCD is a BERT-based clone detection approach that targets high recall of Type 3 and Type 4 clones at scale. It does so by computing a representative embedding for each code fragment and finding similar fragments using a nearest neighbour search. This paper details the approach and an empirical assessment towards configuring and evaluating that approach in industrial setting.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Code clones can detrimentally impact software maintenance and manually detecting them in very large codebases is impractical. Additionally, automated approaches find detection of Type 3 and Type 4 (inexact) clones very challenging. While the most recent artificial deep neural networks (for example BERT-based artificial neural networks) seem to be highly effective in detecting such clones, their pairwise comparison of every code pair in the target system(s) is inefficient and scales poorly on large codebases. We therefore introduce SSCD, a BERT-based clone detection approach that targets high recall of Type 3 and Type 4 clones at scale (in line with our industrial partner's requirements). It does so by computing a representative embedding for each code fragment and finding similar fragments using a nearest neighbour search. SSCD thus avoids the pairwise-comparison bottleneck of other Neural Network approaches while also using parallel, GPU-accelerated search to tackle scalability. This paper details the approach and an empirical assessment towards configuring and evaluating that approach in industrial setting. The configuration analysis suggests that shorter input lengths and text-only based neural network models demonstrate better efficiency in SSCD, while only slightly decreasing effectiveness. The evaluation results suggest that SSCD is more effective than state-of-the-art approaches like SAGA and SourcererCC. It is also highly efficient: in its optimal setting, SSCD effectively locates clones in the entire 320 million LOC BigCloneBench (a standard clone detection benchmark) in just under three hours.

Related papers

Industrial-Scale Neural Network Clone Detection with Disk-Based Similarity Search [0.24091079613649843]
Code clones are similar code fragments that often arise from copy-and-paste programming. We extend existing neural network-based clone detection schemes to handle clones that far exceed available memory. We demonstrate that our approach is around 2$times$ slower than the in-memory approach for a problem size that can fit within memory.
arXiv Detail & Related papers (2025-04-24T22:50:23Z)
Flow-based Detection of Botnets through Bio-inspired Optimisation of Machine Learning [0.5735035463793009]
Botnets could autonomously infect, propagate, communicate and coordinate with other members in the botnet. Traditional detection methods are becoming increasingly unsuitable against various network-based detection evasion methods. This research explores the application of network flow-based behavioural modelling to facilitate the binary classification of bot network activity.
arXiv Detail & Related papers (2024-12-07T15:55:49Z)
CEBin: A Cost-Effective Framework for Large-Scale Binary Code Similarity Detection [23.8834126695488]
Binary code similarity detection (BCSD) is a fundamental technique for various application. We propose a cost-effective BCSD framework, CEBin, which fuses embedding-based and comparison-based approaches.
arXiv Detail & Related papers (2024-02-29T03:02:07Z)
Using Ensemble Inference to Improve Recall of Clone Detection [0.0]
Large-scale source-code clone detection is a challenging task. We employ four state-of-the-art neural network models and evaluate them individually/in combination. The results, on an illustrative dataset of approximately 500K lines of C/C++ code, suggest ensemble inference outperforms individual models in all trialled cases.
arXiv Detail & Related papers (2024-02-12T09:44:59Z)
KECOR: Kernel Coding Rate Maximization for Active 3D Object Detection [48.66703222700795]
We resort to a novel kernel strategy to identify the most informative point clouds to acquire labels. To accommodate both one-stage (i.e., SECOND) and two-stage detectors, we incorporate the classification entropy tangent and well trade-off between detection performance and the total number of bounding boxes selected for annotation. Our results show that approximately 44% box-level annotation costs and 26% computational time are reduced compared to the state-of-the-art method.
arXiv Detail & Related papers (2023-07-16T04:27:03Z)
UNETR++: Delving into Efficient and Accurate 3D Medical Image Segmentation [93.88170217725805]
We propose a 3D medical image segmentation approach, named UNETR++, that offers both high-quality segmentation masks as well as efficiency in terms of parameters, compute cost, and inference speed. The core of our design is the introduction of a novel efficient paired attention (EPA) block that efficiently learns spatial and channel-wise discriminative features. Our evaluations on five benchmarks, Synapse, BTCV, ACDC, BRaTs, and Decathlon-Lung, reveal the effectiveness of our contributions in terms of both efficiency and accuracy.
arXiv Detail & Related papers (2022-12-08T18:59:57Z)
Efficient Nearest Neighbor Search for Cross-Encoder Models using Matrix Factorization [60.91600465922932]
We present an approach that avoids the use of a dual-encoder for retrieval, relying solely on the cross-encoder. Our approach provides test-time recall-vs-computational cost trade-offs superior to the current widely-used methods.
arXiv Detail & Related papers (2022-10-23T00:32:04Z)
ASTRO: An AST-Assisted Approach for Generalizable Neural Clone Detection [12.794933981621941]
Most neural clone detection methods do not generalize beyond the scope of clones that appear in the training dataset. We present an Abstract Syntax Tree (AST) assisted approach for generalizable neural clone detection, or ASTRO. Our experimental results show that ASTRO improves state-of-the-art neural clone detection approaches in both recall and F-1 scores.
arXiv Detail & Related papers (2022-08-17T04:50:51Z)
Evaluation of Contrastive Learning with Various Code Representations for Code Clone Detection [3.699097874146491]
We evaluate contrastive learning for detecting semantic clones of code snippets. We use CodeTransformator to create a dataset that mimics plagiarised code based on competitive programming solutions. The results of our evaluation show that proposed models perform diversely in each task, however the performance of the graph-based models is generally above the others.
arXiv Detail & Related papers (2022-06-17T12:25:44Z)
Spotting adversarial samples for speaker verification by neural vocoders [102.1486475058963]
We adopt neural vocoders to spot adversarial samples for automatic speaker verification (ASV) We find that the difference between the ASV scores for the original and re-synthesize audio is a good indicator for discrimination between genuine and adversarial samples. Our codes will be made open-source for future works to do comparison.
arXiv Detail & Related papers (2021-07-01T08:58:16Z)
Beta-CROWN: Efficient Bound Propagation with Per-neuron Split Constraints for Complete and Incomplete Neural Network Verification [151.62491805851107]
We develop $beta$-CROWN, a bound propagation based verifier that can fully encode per-neuron splits. $beta$-CROWN is close to three orders of magnitude faster than LP-based BaB methods for robustness verification. By terminating BaB early, our method can also be used for incomplete verification.
arXiv Detail & Related papers (2021-03-11T11:56:54Z)
SADet: Learning An Efficient and Accurate Pedestrian Detector [68.66857832440897]
This paper proposes a series of systematic optimization strategies for the detection pipeline of one-stage detector. It forms a single shot anchor-based detector (SADet) for efficient and accurate pedestrian detection. Though structurally simple, it presents state-of-the-art result and real-time speed of $20$ FPS for VGA-resolution images.
arXiv Detail & Related papers (2020-07-26T12:32:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.