Using Ensemble Inference to Improve Recall of Clone Detection
- URL: http://arxiv.org/abs/2402.07523v1
- Date: Mon, 12 Feb 2024 09:44:59 GMT
- Title: Using Ensemble Inference to Improve Recall of Clone Detection
- Authors: Gul Aftab Ahmed, James Vincent Patten, Yuanhua Han, Guoxian Lu, David
Gregg, Jim Buckley, Muslim Chochlov
- Abstract summary: Large-scale source-code clone detection is a challenging task.
We employ four state-of-the-art neural network models and evaluate them individually/in combination.
The results, on an illustrative dataset of approximately 500K lines of C/C++ code, suggest ensemble inference outperforms individual models in all trialled cases.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale source-code clone detection is a challenging task. In our
previous work, we proposed an approach (SSCD) that leverages artificial neural
networks and approximates nearest neighbour search to effectively and
efficiently locate clones in large-scale bodies of code, in a time-efficient
manner. However, our literature review suggests that the relative efficacy of
differing neural network models has not been assessed in the context of
large-scale clone detection approaches. In this work, we aim to assess several
such models individually, in terms of their potential to maximize recall, while
preserving a high level of precision during clone detection. We investigate if
ensemble inference (in this case, using the results of more than one of these
neural network models in combination) can further assist in this task.
To assess this, we employed four state-of-the-art neural network models and
evaluated them individually/in combination. The results, on an illustrative
dataset of approximately 500K lines of C/C++ code, suggest that ensemble
inference outperforms individual models in all trialled cases, when recall is
concerned. Of individual models, the ADA model (belonging to the ChatGPT family
of models) has the best performance. However commercial companies may not be
prepared to hand their proprietary source code over to the cloud, as required
by that approach. Consequently, they may be more interested in an
ensemble-combination of CodeBERT-based and CodeT5 models, resulting in similar
(if slightly lesser) recall and precision results.
Related papers
- Variational autoencoder-based neural network model compression [4.992476489874941]
Variational Autoencoders (VAEs), as a form of deep generative model, have been widely used in recent years.
This paper aims to explore neural network model compression method based on VAE.
arXiv Detail & Related papers (2024-08-25T09:06:22Z) - Latent Semantic Consensus For Deterministic Geometric Model Fitting [109.44565542031384]
We propose an effective method called Latent Semantic Consensus (LSC)
LSC formulates the model fitting problem into two latent semantic spaces based on data points and model hypotheses.
LSC is able to provide consistent and reliable solutions within only a few milliseconds for general multi-structural model fitting.
arXiv Detail & Related papers (2024-03-11T05:35:38Z) - Using a Nearest-Neighbour, BERT-Based Approach for Scalable Clone
Detection [0.0]
SSCD is a BERT-based clone detection approach that targets high recall of Type 3 and Type 4 clones at scale.
It does so by computing a representative embedding for each code fragment and finding similar fragments using a nearest neighbour search.
This paper details the approach and an empirical assessment towards configuring and evaluating that approach in industrial setting.
arXiv Detail & Related papers (2023-09-05T12:38:55Z) - Multilayer Multiset Neuronal Networks -- MMNNs [55.2480439325792]
The present work describes multilayer multiset neuronal networks incorporating two or more layers of coincidence similarity neurons.
The work also explores the utilization of counter-prototype points, which are assigned to the image regions to be avoided.
arXiv Detail & Related papers (2023-08-28T12:55:13Z) - Revisiting the Evaluation of Image Synthesis with GANs [55.72247435112475]
This study presents an empirical investigation into the evaluation of synthesis performance, with generative adversarial networks (GANs) as a representative of generative models.
In particular, we make in-depth analyses of various factors, including how to represent a data point in the representation space, how to calculate a fair distance using selected samples, and how many instances to use from each set.
arXiv Detail & Related papers (2023-04-04T17:54:32Z) - Window-Based Early-Exit Cascades for Uncertainty Estimation: When Deep
Ensembles are More Efficient than Single Models [5.0401589279256065]
We show that ensembles can be more computationally efficient (at inference) than scaling single models within an architecture family.
In this work, we investigate extending these efficiency gains to tasks related to uncertainty estimation.
Experiments on ImageNet-scale data across a number of network architectures and uncertainty tasks show that the proposed window-based early-exit approach is able to achieve a superior uncertainty-computation trade-off.
arXiv Detail & Related papers (2023-03-14T15:57:54Z) - Part-Based Models Improve Adversarial Robustness [57.699029966800644]
We show that combining human prior knowledge with end-to-end learning can improve the robustness of deep neural networks.
Our model combines a part segmentation model with a tiny classifier and is trained end-to-end to simultaneously segment objects into parts.
Our experiments indicate that these models also reduce texture bias and yield better robustness against common corruptions and spurious correlations.
arXiv Detail & Related papers (2022-09-15T15:41:47Z) - Deep Generative model with Hierarchical Latent Factors for Time Series
Anomaly Detection [40.21502451136054]
This work presents DGHL, a new family of generative models for time series anomaly detection.
A top-down Convolution Network maps a novel hierarchical latent space to time series windows, exploiting temporal dynamics to encode information efficiently.
Our method outperformed current state-of-the-art models on four popular benchmark datasets.
arXiv Detail & Related papers (2022-02-15T17:19:44Z) - Multi-fidelity regression using artificial neural networks: efficient
approximation of parameter-dependent output quantities [0.17499351967216337]
We present the use of artificial neural networks applied to multi-fidelity regression problems.
The introduced models are compared against a traditional multi-fidelity scheme, co-kriging.
We also show an application of multi-fidelity regression to an engineering problem.
arXiv Detail & Related papers (2021-02-26T11:29:00Z) - Firearm Detection via Convolutional Neural Networks: Comparing a
Semantic Segmentation Model Against End-to-End Solutions [68.8204255655161]
Threat detection of weapons and aggressive behavior from live video can be used for rapid detection and prevention of potentially deadly incidents.
One way for achieving this is through the use of artificial intelligence and, in particular, machine learning for image analysis.
We compare a traditional monolithic end-to-end deep learning model and a previously proposed model based on an ensemble of simpler neural networks detecting fire-weapons via semantic segmentation.
arXiv Detail & Related papers (2020-12-17T15:19:29Z) - Model Fusion via Optimal Transport [64.13185244219353]
We present a layer-wise model fusion algorithm for neural networks.
We show that this can successfully yield "one-shot" knowledge transfer between neural networks trained on heterogeneous non-i.i.d. data.
arXiv Detail & Related papers (2019-10-12T22:07:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.