Using a Nearest-Neighbour, BERT-Based Approach for Scalable Clone
Detection
- URL: http://arxiv.org/abs/2309.02182v1
- Date: Tue, 5 Sep 2023 12:38:55 GMT
- Title: Using a Nearest-Neighbour, BERT-Based Approach for Scalable Clone
Detection
- Authors: Muslim Chochlov (1), Gul Aftab Ahmed (2), James Vincent Patten (1),
Guoxian Lu (3), Wei Hou (4), David Gregg (2), Jim Buckley (1) ((1) Deptment
of Computer Science and Information Systems, University of Limerick, Ireland,
(2) Deptment of Computer Science, Trinity College Dublin, Ireland, (3) WN
Digital IPD and Trustworthiness Enabling, Huawei Technologies Co., Ltd.,
Shanghai, China, (4) Huawei Vulnerability Management Center, Huawei
Technologies Co., Ltd., Shenzhen, Guangdong, China)
- Abstract summary: SSCD is a BERT-based clone detection approach that targets high recall of Type 3 and Type 4 clones at scale.
It does so by computing a representative embedding for each code fragment and finding similar fragments using a nearest neighbour search.
This paper details the approach and an empirical assessment towards configuring and evaluating that approach in industrial setting.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Code clones can detrimentally impact software maintenance and manually
detecting them in very large codebases is impractical. Additionally, automated
approaches find detection of Type 3 and Type 4 (inexact) clones very
challenging. While the most recent artificial deep neural networks (for example
BERT-based artificial neural networks) seem to be highly effective in detecting
such clones, their pairwise comparison of every code pair in the target
system(s) is inefficient and scales poorly on large codebases.
We therefore introduce SSCD, a BERT-based clone detection approach that
targets high recall of Type 3 and Type 4 clones at scale (in line with our
industrial partner's requirements). It does so by computing a representative
embedding for each code fragment and finding similar fragments using a nearest
neighbour search. SSCD thus avoids the pairwise-comparison bottleneck of other
Neural Network approaches while also using parallel, GPU-accelerated search to
tackle scalability.
This paper details the approach and an empirical assessment towards
configuring and evaluating that approach in industrial setting. The
configuration analysis suggests that shorter input lengths and text-only based
neural network models demonstrate better efficiency in SSCD, while only
slightly decreasing effectiveness. The evaluation results suggest that SSCD is
more effective than state-of-the-art approaches like SAGA and SourcererCC. It
is also highly efficient: in its optimal setting, SSCD effectively locates
clones in the entire 320 million LOC BigCloneBench (a standard clone detection
benchmark) in just under three hours.
Related papers
- CEBin: A Cost-Effective Framework for Large-Scale Binary Code Similarity
Detection [23.8834126695488]
Binary code similarity detection (BCSD) is a fundamental technique for various application.
We propose a cost-effective BCSD framework, CEBin, which fuses embedding-based and comparison-based approaches.
arXiv Detail & Related papers (2024-02-29T03:02:07Z) - Using Ensemble Inference to Improve Recall of Clone Detection [0.0]
Large-scale source-code clone detection is a challenging task.
We employ four state-of-the-art neural network models and evaluate them individually/in combination.
The results, on an illustrative dataset of approximately 500K lines of C/C++ code, suggest ensemble inference outperforms individual models in all trialled cases.
arXiv Detail & Related papers (2024-02-12T09:44:59Z) - KECOR: Kernel Coding Rate Maximization for Active 3D Object Detection [48.66703222700795]
We resort to a novel kernel strategy to identify the most informative point clouds to acquire labels.
To accommodate both one-stage (i.e., SECOND) and two-stage detectors, we incorporate the classification entropy tangent and well trade-off between detection performance and the total number of bounding boxes selected for annotation.
Our results show that approximately 44% box-level annotation costs and 26% computational time are reduced compared to the state-of-the-art method.
arXiv Detail & Related papers (2023-07-16T04:27:03Z) - UNETR++: Delving into Efficient and Accurate 3D Medical Image Segmentation [93.88170217725805]
We propose a 3D medical image segmentation approach, named UNETR++, that offers both high-quality segmentation masks as well as efficiency in terms of parameters, compute cost, and inference speed.
The core of our design is the introduction of a novel efficient paired attention (EPA) block that efficiently learns spatial and channel-wise discriminative features.
Our evaluations on five benchmarks, Synapse, BTCV, ACDC, BRaTs, and Decathlon-Lung, reveal the effectiveness of our contributions in terms of both efficiency and accuracy.
arXiv Detail & Related papers (2022-12-08T18:59:57Z) - Efficient Nearest Neighbor Search for Cross-Encoder Models using Matrix
Factorization [60.91600465922932]
We present an approach that avoids the use of a dual-encoder for retrieval, relying solely on the cross-encoder.
Our approach provides test-time recall-vs-computational cost trade-offs superior to the current widely-used methods.
arXiv Detail & Related papers (2022-10-23T00:32:04Z) - ASTRO: An AST-Assisted Approach for Generalizable Neural Clone Detection [12.794933981621941]
Most neural clone detection methods do not generalize beyond the scope of clones that appear in the training dataset.
We present an Abstract Syntax Tree (AST) assisted approach for generalizable neural clone detection, or ASTRO.
Our experimental results show that ASTRO improves state-of-the-art neural clone detection approaches in both recall and F-1 scores.
arXiv Detail & Related papers (2022-08-17T04:50:51Z) - Evaluation of Contrastive Learning with Various Code Representations for
Code Clone Detection [3.699097874146491]
We evaluate contrastive learning for detecting semantic clones of code snippets.
We use CodeTransformator to create a dataset that mimics plagiarised code based on competitive programming solutions.
The results of our evaluation show that proposed models perform diversely in each task, however the performance of the graph-based models is generally above the others.
arXiv Detail & Related papers (2022-06-17T12:25:44Z) - Spotting adversarial samples for speaker verification by neural vocoders [102.1486475058963]
We adopt neural vocoders to spot adversarial samples for automatic speaker verification (ASV)
We find that the difference between the ASV scores for the original and re-synthesize audio is a good indicator for discrimination between genuine and adversarial samples.
Our codes will be made open-source for future works to do comparison.
arXiv Detail & Related papers (2021-07-01T08:58:16Z) - Beta-CROWN: Efficient Bound Propagation with Per-neuron Split
Constraints for Complete and Incomplete Neural Network Verification [151.62491805851107]
We develop $beta$-CROWN, a bound propagation based verifier that can fully encode per-neuron splits.
$beta$-CROWN is close to three orders of magnitude faster than LP-based BaB methods for robustness verification.
By terminating BaB early, our method can also be used for incomplete verification.
arXiv Detail & Related papers (2021-03-11T11:56:54Z) - SADet: Learning An Efficient and Accurate Pedestrian Detector [68.66857832440897]
This paper proposes a series of systematic optimization strategies for the detection pipeline of one-stage detector.
It forms a single shot anchor-based detector (SADet) for efficient and accurate pedestrian detection.
Though structurally simple, it presents state-of-the-art result and real-time speed of $20$ FPS for VGA-resolution images.
arXiv Detail & Related papers (2020-07-26T12:32:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.