Related papers: Using Ensemble Inference to Improve Recall of Clone Detection

Using Ensemble Inference to Improve Recall of Clone Detection

URL: http://arxiv.org/abs/2402.07523v1
Date: Mon, 12 Feb 2024 09:44:59 GMT
Title: Using Ensemble Inference to Improve Recall of Clone Detection
Authors: Gul Aftab Ahmed, James Vincent Patten, Yuanhua Han, Guoxian Lu, David Gregg, Jim Buckley, Muslim Chochlov
Abstract summary: Large-scale source-code clone detection is a challenging task. We employ four state-of-the-art neural network models and evaluate them individually/in combination. The results, on an illustrative dataset of approximately 500K lines of C/C++ code, suggest ensemble inference outperforms individual models in all trialled cases.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large-scale source-code clone detection is a challenging task. In our previous work, we proposed an approach (SSCD) that leverages artificial neural networks and approximates nearest neighbour search to effectively and efficiently locate clones in large-scale bodies of code, in a time-efficient manner. However, our literature review suggests that the relative efficacy of differing neural network models has not been assessed in the context of large-scale clone detection approaches. In this work, we aim to assess several such models individually, in terms of their potential to maximize recall, while preserving a high level of precision during clone detection. We investigate if ensemble inference (in this case, using the results of more than one of these neural network models in combination) can further assist in this task. To assess this, we employed four state-of-the-art neural network models and evaluated them individually/in combination. The results, on an illustrative dataset of approximately 500K lines of C/C++ code, suggest that ensemble inference outperforms individual models in all trialled cases, when recall is concerned. Of individual models, the ADA model (belonging to the ChatGPT family of models) has the best performance. However commercial companies may not be prepared to hand their proprietary source code over to the cloud, as required by that approach. Consequently, they may be more interested in an ensemble-combination of CodeBERT-based and CodeT5 models, resulting in similar (if slightly lesser) recall and precision results.

Related papers

Variational autoencoder-based neural network model compression [4.992476489874941]
Variational Autoencoders (VAEs), as a form of deep generative model, have been widely used in recent years. This paper aims to explore neural network model compression method based on VAE.
arXiv Detail & Related papers (2024-08-25T09:06:22Z)
Latent Semantic Consensus For Deterministic Geometric Model Fitting [109.44565542031384]
We propose an effective method called Latent Semantic Consensus (LSC) LSC formulates the model fitting problem into two latent semantic spaces based on data points and model hypotheses. LSC is able to provide consistent and reliable solutions within only a few milliseconds for general multi-structural model fitting.
arXiv Detail & Related papers (2024-03-11T05:35:38Z)
Using a Nearest-Neighbour, BERT-Based Approach for Scalable Clone Detection [0.0]
SSCD is a BERT-based clone detection approach that targets high recall of Type 3 and Type 4 clones at scale. It does so by computing a representative embedding for each code fragment and finding similar fragments using a nearest neighbour search. This paper details the approach and an empirical assessment towards configuring and evaluating that approach in industrial setting.
arXiv Detail & Related papers (2023-09-05T12:38:55Z)
Multilayer Multiset Neuronal Networks -- MMNNs [55.2480439325792]
The present work describes multilayer multiset neuronal networks incorporating two or more layers of coincidence similarity neurons. The work also explores the utilization of counter-prototype points, which are assigned to the image regions to be avoided.
arXiv Detail & Related papers (2023-08-28T12:55:13Z)
Revisiting the Evaluation of Image Synthesis with GANs [55.72247435112475]
This study presents an empirical investigation into the evaluation of synthesis performance, with generative adversarial networks (GANs) as a representative of generative models. In particular, we make in-depth analyses of various factors, including how to represent a data point in the representation space, how to calculate a fair distance using selected samples, and how many instances to use from each set.
arXiv Detail & Related papers (2023-04-04T17:54:32Z)
Window-Based Early-Exit Cascades for Uncertainty Estimation: When Deep Ensembles are More Efficient than Single Models [5.0401589279256065]
We show that ensembles can be more computationally efficient (at inference) than scaling single models within an architecture family. In this work, we investigate extending these efficiency gains to tasks related to uncertainty estimation. Experiments on ImageNet-scale data across a number of network architectures and uncertainty tasks show that the proposed window-based early-exit approach is able to achieve a superior uncertainty-computation trade-off.
arXiv Detail & Related papers (2023-03-14T15:57:54Z)
Part-Based Models Improve Adversarial Robustness [57.699029966800644]
We show that combining human prior knowledge with end-to-end learning can improve the robustness of deep neural networks. Our model combines a part segmentation model with a tiny classifier and is trained end-to-end to simultaneously segment objects into parts. Our experiments indicate that these models also reduce texture bias and yield better robustness against common corruptions and spurious correlations.
arXiv Detail & Related papers (2022-09-15T15:41:47Z)
Deep Generative model with Hierarchical Latent Factors for Time Series Anomaly Detection [40.21502451136054]
This work presents DGHL, a new family of generative models for time series anomaly detection. A top-down Convolution Network maps a novel hierarchical latent space to time series windows, exploiting temporal dynamics to encode information efficiently. Our method outperformed current state-of-the-art models on four popular benchmark datasets.
arXiv Detail & Related papers (2022-02-15T17:19:44Z)
Multi-fidelity regression using artificial neural networks: efficient approximation of parameter-dependent output quantities [0.17499351967216337]
We present the use of artificial neural networks applied to multi-fidelity regression problems. The introduced models are compared against a traditional multi-fidelity scheme, co-kriging. We also show an application of multi-fidelity regression to an engineering problem.
arXiv Detail & Related papers (2021-02-26T11:29:00Z)
Firearm Detection via Convolutional Neural Networks: Comparing a Semantic Segmentation Model Against End-to-End Solutions [68.8204255655161]
Threat detection of weapons and aggressive behavior from live video can be used for rapid detection and prevention of potentially deadly incidents. One way for achieving this is through the use of artificial intelligence and, in particular, machine learning for image analysis. We compare a traditional monolithic end-to-end deep learning model and a previously proposed model based on an ensemble of simpler neural networks detecting fire-weapons via semantic segmentation.
arXiv Detail & Related papers (2020-12-17T15:19:29Z)
Model Fusion via Optimal Transport [64.13185244219353]
We present a layer-wise model fusion algorithm for neural networks. We show that this can successfully yield "one-shot" knowledge transfer between neural networks trained on heterogeneous non-i.i.d. data.
arXiv Detail & Related papers (2019-10-12T22:07:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.