ASTRO: An AST-Assisted Approach for Generalizable Neural Clone Detection
- URL: http://arxiv.org/abs/2208.08067v1
- Date: Wed, 17 Aug 2022 04:50:51 GMT
- Title: ASTRO: An AST-Assisted Approach for Generalizable Neural Clone Detection
- Authors: Yifan Zhang, Junwen Yang, Haoyu Dong, Qingchen Wang, Huajie Shao,
Kevin Leach, Yu Huang
- Abstract summary: Most neural clone detection methods do not generalize beyond the scope of clones that appear in the training dataset.
We present an Abstract Syntax Tree (AST) assisted approach for generalizable neural clone detection, or ASTRO.
Our experimental results show that ASTRO improves state-of-the-art neural clone detection approaches in both recall and F-1 scores.
- Score: 12.794933981621941
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Neural clone detection has attracted the attention of software engineering
researchers and practitioners. However, most neural clone detection methods do
not generalize beyond the scope of clones that appear in the training dataset.
This results in poor model performance, especially in terms of model recall. In
this paper, we present an Abstract Syntax Tree (AST) assisted approach for
generalizable neural clone detection, or ASTRO, a framework for finding clones
in codebases reflecting industry practices. We present three main components:
(1) an AST-inspired representation for source code that leverages program
structure and semantics, (2) a global graph representation that captures the
context of an AST among a corpus of programs, and (3) a graph embedding for
programs that, in combination with extant large-scale language models, improves
state-of-the-art code clone detection. Our experimental results show that ASTRO
improves state-of-the-art neural clone detection approaches in both recall and
F-1 scores.
Related papers
- Optimizing OOD Detection in Molecular Graphs: A Novel Approach with Diffusion Models [71.39421638547164]
We propose to detect OOD molecules by adopting an auxiliary diffusion model-based framework, which compares similarities between input molecules and reconstructed graphs.
Due to the generative bias towards reconstructing ID training samples, the similarity scores of OOD molecules will be much lower to facilitate detection.
Our research pioneers an approach of Prototypical Graph Reconstruction for Molecular OOD Detection, dubbed as PGR-MOOD and hinges on three innovations.
arXiv Detail & Related papers (2024-04-24T03:25:53Z) - Fusing Dictionary Learning and Support Vector Machines for Unsupervised Anomaly Detection [1.5999407512883508]
We introduce a new anomaly detection model that unifies the OC-SVM and DL residual functions into a single composite objective.
We extend both objectives to the more general setting that allows the use of kernel functions.
arXiv Detail & Related papers (2024-04-05T12:41:53Z) - Using Ensemble Inference to Improve Recall of Clone Detection [0.0]
Large-scale source-code clone detection is a challenging task.
We employ four state-of-the-art neural network models and evaluate them individually/in combination.
The results, on an illustrative dataset of approximately 500K lines of C/C++ code, suggest ensemble inference outperforms individual models in all trialled cases.
arXiv Detail & Related papers (2024-02-12T09:44:59Z) - Understanding the Natural Language of DNA using Encoder-Decoder Foundation Models with Byte-level Precision [26.107996342704915]
This paper presents the Ensemble Nucleotide Byte-level-Decoder (ENBED) foundation model, analyzing DNA sequences at byte-level precision with an encoder-decoder Transformer architecture.
We use Masked Language Modeling to pre-train the foundation model using reference genome sequences and apply it in the following downstream tasks.
In each of these tasks, we demonstrate significant improvement as compared to the existing state-of-the-art results.
arXiv Detail & Related papers (2023-11-04T06:00:56Z) - Using a Nearest-Neighbour, BERT-Based Approach for Scalable Clone
Detection [0.0]
SSCD is a BERT-based clone detection approach that targets high recall of Type 3 and Type 4 clones at scale.
It does so by computing a representative embedding for each code fragment and finding similar fragments using a nearest neighbour search.
This paper details the approach and an empirical assessment towards configuring and evaluating that approach in industrial setting.
arXiv Detail & Related papers (2023-09-05T12:38:55Z) - Multilayer Multiset Neuronal Networks -- MMNNs [55.2480439325792]
The present work describes multilayer multiset neuronal networks incorporating two or more layers of coincidence similarity neurons.
The work also explores the utilization of counter-prototype points, which are assigned to the image regions to be avoided.
arXiv Detail & Related papers (2023-08-28T12:55:13Z) - Few-Shot Named Entity Recognition: A Comprehensive Study [92.40991050806544]
We investigate three schemes to improve the model generalization ability for few-shot settings.
We perform empirical comparisons on 10 public NER datasets with various proportions of labeled data.
We create new state-of-the-art results on both few-shot and training-free settings.
arXiv Detail & Related papers (2020-12-29T23:43:16Z) - CASTLE: Regularization via Auxiliary Causal Graph Discovery [89.74800176981842]
We introduce Causal Structure Learning (CASTLE) regularization and propose to regularize a neural network by jointly learning the causal relationships between variables.
CASTLE efficiently reconstructs only the features in the causal DAG that have a causal neighbor, whereas reconstruction-based regularizers suboptimally reconstruct all input features.
arXiv Detail & Related papers (2020-09-28T09:49:38Z) - A Self-Supervised Gait Encoding Approach with Locality-Awareness for 3D
Skeleton Based Person Re-Identification [65.18004601366066]
Person re-identification (Re-ID) via gait features within 3D skeleton sequences is a newly-emerging topic with several advantages.
This paper proposes a self-supervised gait encoding approach that can leverage unlabeled skeleton data to learn gait representations for person Re-ID.
arXiv Detail & Related papers (2020-09-05T16:06:04Z) - Improved Code Summarization via a Graph Neural Network [96.03715569092523]
In general, source code summarization techniques use the source code as input and outputs a natural language description.
We present an approach that uses a graph-based neural architecture that better matches the default structure of the AST to generate these summaries.
arXiv Detail & Related papers (2020-04-06T17:36:42Z) - Detecting Code Clones with Graph Neural Networkand Flow-Augmented
Abstract Syntax Tree [30.484662671342935]
We build a graph representation of programs called flow-augmented abstract syntax tree (FA-AST)
We apply two different types of graph neural networks on FA-AST to measure the similarity of code pairs.
Our approach outperforms the state-of-the-art approaches on both Google Code Jam and BigCloneBench tasks.
arXiv Detail & Related papers (2020-02-20T10:18:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.