Related papers: AST-Enhanced or AST-Overloaded? The Surprising Impact of Hybrid Graph Representations on Code Clone Detection

AST-Enhanced or AST-Overloaded? The Surprising Impact of Hybrid Graph Representations on Code Clone Detection

URL: http://arxiv.org/abs/2506.14470v1
Date: Tue, 17 Jun 2025 12:35:17 GMT
Title: AST-Enhanced or AST-Overloaded? The Surprising Impact of Hybrid Graph Representations on Code Clone Detection
Authors: Zixian Zhang, Takfarinas Saber,
Abstract summary: Code clones significantly increase software maintenance costs and heighten vulnerability risks.<n>ASTs dominate deep learning-based code clone detection due to their precise syntactic structure representation.<n>Recent studies address this by enriching AST-based representations with semantic graphs.
Score: 0.0
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: As one of the most detrimental code smells, code clones significantly increase software maintenance costs and heighten vulnerability risks, making their detection a critical challenge in software engineering. Abstract Syntax Trees (ASTs) dominate deep learning-based code clone detection due to their precise syntactic structure representation, but they inherently lack semantic depth. Recent studies address this by enriching AST-based representations with semantic graphs, such as Control Flow Graphs (CFGs) and Data Flow Graphs (DFGs). However, the effectiveness of various enriched AST-based representations and their compatibility with different graph-based machine learning techniques remains an open question, warranting further investigation to unlock their full potential in addressing the complexities of code clone detection. In this paper, we present a comprehensive empirical study to rigorously evaluate the effectiveness of AST-based hybrid graph representations in Graph Neural Network (GNN)-based code clone detection. We systematically compare various hybrid representations ((CFG, DFG, Flow-Augmented ASTs (FA-AST)) across multiple GNN architectures. Our experiments reveal that hybrid representations impact GNNs differently: while AST+CFG+DFG consistently enhances accuracy for convolution- and attention-based models (Graph Convolutional Networks (GCN), Graph Attention Networks (GAT)), FA-AST frequently introduces structural complexity that harms performance. Notably, GMN outperforms others even with standard AST representations, highlighting its superior cross-code similarity detection and reducing the need for enriched structures.

Related papers

GNN-Coder: Boosting Semantic Code Retrieval with Combined GNNs and Transformer [15.991615273248804]
We introduce GNN-Coder, a novel framework based on Graph Neural Network (GNN) to utilize Abstract Syntax Tree (AST)<n>GNN-Coder significantly boosts retrieval performance, with a 1%-10% improvement in MRR on the CSN dataset, and a notable 20% gain in zero-shot performance on the CosQA dataset.
arXiv Detail & Related papers (2025-02-21T04:29:53Z)
Adaptive Homophily Clustering: Structure Homophily Graph Learning with Adaptive Filter for Hyperspectral Image [21.709368882043897]
Hyperspectral image (HSI) clustering has been a fundamental but challenging task with zero training labels.<n>In this paper, a homophily structure graph learning with an adaptive filter clustering method (AHSGC) for HSI is proposed.<n>Our AHSGC contains high clustering accuracy, low computational complexity, and strong robustness.
arXiv Detail & Related papers (2025-01-03T01:54:16Z)
Heterogeneous Directed Hypergraph Neural Network over abstract syntax tree (AST) for Code Classification [9.01892294402701]
We propose to represent AST as a heterogeneous directed hypergraph (HDHG) and process the graph by heterogeneous directed hypergraph neural network (HDHGN) for code classification. Our method improves code understanding and can represent high-order data correlations beyond paired interactions.
arXiv Detail & Related papers (2023-05-07T09:28:16Z)
Resisting Graph Adversarial Attack via Cooperative Homophilous Augmentation [60.50994154879244]
Recent studies show that Graph Neural Networks are vulnerable and easily fooled by small perturbations. In this work, we focus on the emerging but critical attack, namely, Graph Injection Attack. We propose a general defense framework CHAGNN against GIA through cooperative homophilous augmentation of graph data and model.
arXiv Detail & Related papers (2022-11-15T11:44:31Z)
Simple and Efficient Heterogeneous Graph Neural Network [55.56564522532328]
Heterogeneous graph neural networks (HGNNs) have powerful capability to embed rich structural and semantic information of a heterogeneous graph into node representations. Existing HGNNs inherit many mechanisms from graph neural networks (GNNs) over homogeneous graphs, especially the attention mechanism and the multi-layer structure. This paper conducts an in-depth and detailed study of these mechanisms and proposes Simple and Efficient Heterogeneous Graph Neural Network (SeHGNN)
arXiv Detail & Related papers (2022-07-06T10:01:46Z)
SCGC : Self-Supervised Contrastive Graph Clustering [1.1470070927586016]
Graph clustering discovers groups or communities within networks. Deep learning methods such as autoencoders cannot incorporate rich structural information. We propose Self-Supervised Contrastive Graph Clustering (SCGC)
arXiv Detail & Related papers (2022-04-27T01:38:46Z)
GN-Transformer: Fusing Sequence and Graph Representation for Improved Code Summarization [0.0]
We propose a novel method, GN-Transformer, to learn end-to-end on a fused sequence and graph modality. The proposed methods achieve state-of-the-art performance in two code summarization datasets and across three automatic code summarization metrics.
arXiv Detail & Related papers (2021-11-17T02:51:37Z)
Software Vulnerability Detection via Deep Learning over Disaggregated Code Graph Representation [57.92972327649165]
This work explores a deep learning approach to automatically learn the insecure patterns from code corpora. Because code naturally admits graph structures with parsing, we develop a novel graph neural network (GNN) to exploit both the semantic context and structural regularity of a program.
arXiv Detail & Related papers (2021-09-07T21:24:36Z)
Improving Graph Neural Network Expressivity via Subgraph Isomorphism Counting [63.04999833264299]
"Graph Substructure Networks" (GSN) is a topologically-aware message passing scheme based on substructure encoding. We show that it is strictly more expressive than the Weisfeiler-Leman (WL) graph isomorphism test. We perform an extensive evaluation on graph classification and regression tasks and obtain state-of-the-art results in diverse real-world settings.
arXiv Detail & Related papers (2020-06-16T15:30:31Z)
Learning to Hash with Graph Neural Networks for Recommender Systems [103.82479899868191]
Graph representation learning has attracted much attention in supporting high quality candidate search at scale. Despite its effectiveness in learning embedding vectors for objects in the user-item interaction network, the computational costs to infer users' preferences in continuous embedding space are tremendous. We propose a simple yet effective discrete representation learning framework to jointly learn continuous and discrete codes.
arXiv Detail & Related papers (2020-03-04T06:59:56Z)
Efficient and Stable Graph Scattering Transforms via Pruning [86.76336979318681]
Graph scattering transforms ( GSTs) offer training-free deep GCN models that extract features from graph data. The price paid by GSTs is exponential complexity in space and time that increases with the number of layers. The present work addresses the complexity limitation of GSTs by introducing an efficient so-termed pruned (p) GST approach.
arXiv Detail & Related papers (2020-01-27T16:05:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.