Breaking Obfuscation: Cluster-Aware Graph with LLM-Aided Recovery for Malicious JavaScript Detection
- URL: http://arxiv.org/abs/2507.22447v1
- Date: Wed, 30 Jul 2025 07:46:49 GMT
- Title: Breaking Obfuscation: Cluster-Aware Graph with LLM-Aided Recovery for Malicious JavaScript Detection
- Authors: Zhihong Liang, Xin Wang, Zhenhuang Hu, Liangliang Song, Lin Chen, Jingjing Guo, Yanbin Wang, Ye Tian,
- Abstract summary: Malicious JavaScript code poses significant threats to user privacy, system integrity, and enterprise security.<n>We propose DeCoda, a hybrid defense framework that combines large language model (LLM)-based deobfuscation with code graph learning.
- Score: 9.83040332336481
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the rapid expansion of web-based applications and cloud services, malicious JavaScript code continues to pose significant threats to user privacy, system integrity, and enterprise security. But, detecting such threats remains challenging due to sophisticated code obfuscation techniques and JavaScript's inherent language characteristics, particularly its nested closure structures and syntactic flexibility. In this work, we propose DeCoda, a hybrid defense framework that combines large language model (LLM)-based deobfuscation with code graph learning: (1) We first construct a sophisticated prompt-learning pipeline with multi-stage refinement, where the LLM progressively reconstructs the original code structure from obfuscated inputs and then generates normalized Abstract Syntax Tree (AST) representations; (2) In JavaScript ASTs, dynamic typing scatters semantically similar nodes while deeply nested functions fracture scope capturing, introducing structural noise and semantic ambiguity. To address these challenges, we then propose to learn hierarchical code graph representations via a Cluster-wise Graph that synergistically integrates graph transformer network, node clustering, and node-to-cluster attention to simultaneously capture both local node-level semantics and global cluster-induced structural relationships from AST graph. Experimental results demonstrate that our method achieves F1-scores of 94.64% and 97.71% on two benchmark datasets, demonstrating absolute improvements of 10.74% and 13.85% over state-of-the-art baselines. In false-positive control evaluation at fixed FPR levels (0.0001, 0.001, 0.01), our approach delivers 4.82, 5.91, and 2.53 higher TPR respectively compared to the best-performing baseline. These results highlight the effectiveness of LLM-based deobfuscation and underscore the importance of modeling cluster-level relationships in detecting malicious code.
Related papers
- AST-Enhanced or AST-Overloaded? The Surprising Impact of Hybrid Graph Representations on Code Clone Detection [0.0]
Code clones significantly increase software maintenance costs and heighten vulnerability risks.<n>ASTs dominate deep learning-based code clone detection due to their precise syntactic structure representation.<n>Recent studies address this by enriching AST-based representations with semantic graphs.
arXiv Detail & Related papers (2025-06-17T12:35:17Z) - Obfuscation-Resilient Binary Code Similarity Analysis using Dominance Enhanced Semantic Graph [11.549110908614873]
ORCAS is an Obfuscation-Resilient BCSA model based on Dominance Enhanced Semantic Graph (DESG)<n>We develop ORCAS, an Obfuscation-Resilient BCSA model based on Dominance Enhanced Semantic Graph (DESG)
arXiv Detail & Related papers (2025-06-06T15:26:53Z) - Secret Breach Detection in Source Code with Large Language Models [2.5484785866796833]
Leaking sensitive information in source code remains a persistent security threat.<n>This work aims to enhance secret detection in source code using large language models (LLMs)<n>We evaluate the feasibility of using fine-tuned, smaller models for local deployment.
arXiv Detail & Related papers (2025-04-26T03:33:14Z) - GNN-Coder: Boosting Semantic Code Retrieval with Combined GNNs and Transformer [15.991615273248804]
We introduce GNN-Coder, a novel framework based on Graph Neural Network (GNN) to utilize Abstract Syntax Tree (AST)<n>GNN-Coder significantly boosts retrieval performance, with a 1%-10% improvement in MRR on the CSN dataset, and a notable 20% gain in zero-shot performance on the CosQA dataset.
arXiv Detail & Related papers (2025-02-21T04:29:53Z) - Leveraging Joint Predictive Embedding and Bayesian Inference in Graph Self Supervised Learning [0.0]
Graph representation learning has emerged as a cornerstone for tasks like node classification and link prediction.<n>Current self-supervised learning (SSL) methods face challenges such as computational inefficiency, reliance on contrastive objectives, and representation collapse.<n>We propose a novel joint embedding predictive framework for graph SSL that eliminates contrastive objectives and negative sampling while preserving semantic and structural information.
arXiv Detail & Related papers (2025-02-02T07:42:45Z) - Bridge the Points: Graph-based Few-shot Segment Anything Semantically [79.1519244940518]
Recent advancements in pre-training techniques have enhanced the capabilities of vision foundation models.
Recent studies extend the SAM to Few-shot Semantic segmentation (FSS)
We propose a simple yet effective approach based on graph analysis.
arXiv Detail & Related papers (2024-10-09T15:02:28Z) - FoC: Figure out the Cryptographic Functions in Stripped Binaries with LLMs [51.898805184427545]
We propose a novel framework called FoC to Figure out the Cryptographic functions in stripped binaries.<n>We first build a binary large language model (FoC-BinLLM) to summarize the semantics of cryptographic functions in natural language.<n>We then build a binary code similarity model (FoC-Sim) upon the FoC-BinLLM to create change-sensitive representations and use it to retrieve similar implementations of unknown cryptographic functions in a database.
arXiv Detail & Related papers (2024-03-27T09:45:33Z) - UGMAE: A Unified Framework for Graph Masked Autoencoders [67.75493040186859]
We propose UGMAE, a unified framework for graph masked autoencoders.
We first develop an adaptive feature mask generator to account for the unique significance of nodes.
We then design a ranking-based structure reconstruction objective joint with feature reconstruction to capture holistic graph information.
arXiv Detail & Related papers (2024-02-12T19:39:26Z) - Semantics-Aware Dynamic Localization and Refinement for Referring Image
Segmentation [102.25240608024063]
Referring image segments an image from a language expression.
We develop an algorithm that shifts from being localization-centric to segmentation-language.
Compared to its counterparts, our method is more versatile yet effective.
arXiv Detail & Related papers (2023-03-11T08:42:40Z) - VELVET: a noVel Ensemble Learning approach to automatically locate
VulnErable sTatements [62.93814803258067]
This paper presents VELVET, a novel ensemble learning approach to locate vulnerable statements in source code.
Our model combines graph-based and sequence-based neural networks to successfully capture the local and global context of a program graph.
VELVET achieves 99.6% and 43.6% top-1 accuracy over synthetic data and real-world data, respectively.
arXiv Detail & Related papers (2021-12-20T22:45:27Z) - Software Vulnerability Detection via Deep Learning over Disaggregated
Code Graph Representation [57.92972327649165]
This work explores a deep learning approach to automatically learn the insecure patterns from code corpora.
Because code naturally admits graph structures with parsing, we develop a novel graph neural network (GNN) to exploit both the semantic context and structural regularity of a program.
arXiv Detail & Related papers (2021-09-07T21:24:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.