K-ASTRO: Structure-Aware Adaptation of LLMs for Code Vulnerability Detection
- URL: http://arxiv.org/abs/2208.08067v3
- Date: Mon, 13 Oct 2025 22:24:55 GMT
- Title: K-ASTRO: Structure-Aware Adaptation of LLMs for Code Vulnerability Detection
- Authors: Yifan Zhang, Michael Sandborn, Stefan Larson, Yu Huang, Kevin Leach,
- Abstract summary: K-ASTRO is a lightweight Transformer model that combines semantic embeddings from Large Language Models with structural features of Abstract Syntax Trees (ASTs) to improve efficiency and accuracy in code vulnerability detection.<n>Our approach introduces an AST-based augmentation technique inspired by mutation testing, a structure-aware attention mechanism that incorporates augmented AST features, and a joint adaptation pipeline to unify code semantics and syntax.
- Score: 12.458619777971956
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) are transforming software engineering tasks, including code vulnerability detection-a critical area of software security. However, existing methods often rely on resource-intensive models or graph-based techniques, limiting their accessibility and practicality. This paper introduces K-ASTRO, a lightweight Transformer model that combines semantic embeddings from LLMs with structural features of Abstract Syntax Trees (ASTs) to improve both efficiency and accuracy in code vulnerability detection. Our approach introduces an AST-based augmentation technique inspired by mutation testing, a structure-aware attention mechanism that incorporates augmented AST features, and a joint adaptation pipeline to unify code semantics and syntax. Experimental results on three large-scale datasets, including BigVul, DiverseVul, and PrimeVul-demonstrate state-of-the-art performance while enabling rapid inference on CPUs with minimal training time. By offering a scalable, interpretable, and efficient solution, K-ASTRO bridges the gap between LLM advancements and practical software vulnerability detection, providing open-sourced tools to foster further research.
Related papers
- RooflineBench: A Benchmarking Framework for On-Device LLMs via Roofline Analysis [53.90240071275054]
The transition toward localized intelligence through Small Language Models (SLMs) has intensified the need for rigorous performance characterization on resource-constrained edge hardware.<n>We propose a systematic framework that unifies architectural primitives and hardware constraints through the lens of operational intensity (OI)<n>By defining an inference-potential region, we introduce the Relative Inference Potential as a novel metric to compare efficiency differences between Large Language Models (LLMs) on the same hardware substrate.
arXiv Detail & Related papers (2026-02-12T03:02:22Z) - Ensembling Large Language Models for Code Vulnerability Detection: An Empirical Evaluation [69.8237598448941]
This study investigates the potential of ensemble learning to enhance the performance of Large Language Models (LLMs) in source code vulnerability detection.<n>We propose Dynamic Gated Stacking (DGS), a Stacking variant tailored for vulnerability detection.
arXiv Detail & Related papers (2025-09-16T03:48:22Z) - Quantization Meets dLLMs: A Systematic Study of Post-training Quantization for Diffusion LLMs [78.09559830840595]
We present the first systematic study on quantizing diffusion-based language models.<n>We identify the presence of activation outliers, characterized by abnormally large activation values.<n>We implement state-of-the-art PTQ methods and conduct a comprehensive evaluation.
arXiv Detail & Related papers (2025-08-20T17:59:51Z) - Training Language Models to Generate Quality Code with Program Analysis Feedback [66.0854002147103]
Code generation with large language models (LLMs) is increasingly adopted in production but fails to ensure code quality.<n>We propose REAL, a reinforcement learning framework that incentivizes LLMs to generate production-quality code.
arXiv Detail & Related papers (2025-05-28T17:57:47Z) - Enhancing Large Language Models with Faster Code Preprocessing for Vulnerability Detection [0.0]
We build on the existing SCoPE framework and introduce SCoPE2, an enhanced version with improved performance.<n>Our results show a 97.3% reduction in processing time with SCoPE2, along with an improved F1-score for the Large Language Model (LLM) for vulnerability detection.
arXiv Detail & Related papers (2025-05-08T19:00:11Z) - EnStack: An Ensemble Stacking Framework of Large Language Models for Enhanced Vulnerability Detection in Source Code [1.9374282535132379]
We introduce EnStack, a novel ensemble stacking framework that enhances vulnerability detection using natural language processing (NLP) techniques.
Our approach synergizes multiple pre-trained large language models (LLMs) specialized in code understanding.
meta-classifiers consolidate the strengths of each LLM, resulting in a comprehensive model that excels in detecting subtle and complex vulnerabilities.
arXiv Detail & Related papers (2024-11-25T16:47:10Z) - Reference Trustable Decoding: A Training-Free Augmentation Paradigm for Large Language Models [79.41139393080736]
Large language models (LLMs) have rapidly advanced and demonstrated impressive capabilities.
In-Context Learning (ICL) and.
Efficient Fine-Tuning (PEFT) are currently two mainstream methods for augmenting.
LLMs to downstream tasks.
We propose Reference Trustable Decoding (RTD), a paradigm that allows models to quickly adapt to new tasks without fine-tuning.
arXiv Detail & Related papers (2024-09-30T10:48:20Z) - M2CVD: Enhancing Vulnerability Semantic through Multi-Model Collaboration for Code Vulnerability Detection [52.4455893010468]
Large Language Models (LLMs) have strong capabilities in code comprehension, but fine-tuning costs and semantic alignment issues limit their project-specific optimization.
Code models such CodeBERT are easy to fine-tune, but it is often difficult to learn vulnerability semantics from complex code languages.
This paper introduces the Multi-Model Collaborative Vulnerability Detection approach (M2CVD) to improve the detection accuracy of code models.
arXiv Detail & Related papers (2024-06-10T00:05:49Z) - Security Vulnerability Detection with Multitask Self-Instructed Fine-Tuning of Large Language Models [8.167614500821223]
We introduce MSIVD, multitask self-instructed fine-tuning for vulnerability detection, inspired by chain-of-thought prompting and LLM self-instruction.
Our experiments demonstrate that MSIVD achieves superior performance, outperforming the highest LLM-based vulnerability detector baseline (LineVul) with a F1 score of 0.92 on the BigVul dataset, and 0.48 on the PreciseBugs dataset.
arXiv Detail & Related papers (2024-06-09T19:18:05Z) - Optimizing OOD Detection in Molecular Graphs: A Novel Approach with Diffusion Models [71.39421638547164]
We propose to detect OOD molecules by adopting an auxiliary diffusion model-based framework, which compares similarities between input molecules and reconstructed graphs.
Due to the generative bias towards reconstructing ID training samples, the similarity scores of OOD molecules will be much lower to facilitate detection.
Our research pioneers an approach of Prototypical Graph Reconstruction for Molecular OOD Detection, dubbed as PGR-MOOD and hinges on three innovations.
arXiv Detail & Related papers (2024-04-24T03:25:53Z) - Fusing Dictionary Learning and Support Vector Machines for Unsupervised Anomaly Detection [1.5999407512883508]
We introduce a new anomaly detection model that unifies the OC-SVM and DL residual functions into a single composite objective.
We extend both objectives to the more general setting that allows the use of kernel functions.
arXiv Detail & Related papers (2024-04-05T12:41:53Z) - An Empirical Study of Automated Vulnerability Localization with Large Language Models [21.84971967029474]
Large Language Models (LLMs) have shown potential in various domains, yet their effectiveness in vulnerability localization remains underexplored.
Our investigation encompasses 10+ leading LLMs suitable for code analysis, including ChatGPT and various open-source models.
We explore the efficacy of these LLMs using 4 distinct paradigms: zero-shot learning, one-shot learning, discriminative fine-tuning, and generative fine-tuning.
arXiv Detail & Related papers (2024-03-30T08:42:10Z) - Using Ensemble Inference to Improve Recall of Clone Detection [0.0]
Large-scale source-code clone detection is a challenging task.
We employ four state-of-the-art neural network models and evaluate them individually/in combination.
The results, on an illustrative dataset of approximately 500K lines of C/C++ code, suggest ensemble inference outperforms individual models in all trialled cases.
arXiv Detail & Related papers (2024-02-12T09:44:59Z) - Understanding the Natural Language of DNA using Encoder-Decoder Foundation Models with Byte-level Precision [26.107996342704915]
This paper presents the Ensemble Nucleotide Byte-level-Decoder (ENBED) foundation model, analyzing DNA sequences at byte-level precision with an encoder-decoder Transformer architecture.
We use Masked Language Modeling to pre-train the foundation model using reference genome sequences and apply it in the following downstream tasks.
In each of these tasks, we demonstrate significant improvement as compared to the existing state-of-the-art results.
arXiv Detail & Related papers (2023-11-04T06:00:56Z) - Using a Nearest-Neighbour, BERT-Based Approach for Scalable Clone
Detection [0.0]
SSCD is a BERT-based clone detection approach that targets high recall of Type 3 and Type 4 clones at scale.
It does so by computing a representative embedding for each code fragment and finding similar fragments using a nearest neighbour search.
This paper details the approach and an empirical assessment towards configuring and evaluating that approach in industrial setting.
arXiv Detail & Related papers (2023-09-05T12:38:55Z) - Multilayer Multiset Neuronal Networks -- MMNNs [55.2480439325792]
The present work describes multilayer multiset neuronal networks incorporating two or more layers of coincidence similarity neurons.
The work also explores the utilization of counter-prototype points, which are assigned to the image regions to be avoided.
arXiv Detail & Related papers (2023-08-28T12:55:13Z) - DCDetector: An IoT terminal vulnerability mining system based on
distributed deep ensemble learning under source code representation [2.561778620560749]
The goal of the research is to intelligently detect vulnerabilities in source codes of high-level languages such as C/C++.
This enables us to propose a code representation of sensitive sentence-related slices of source code, and to detect vulnerabilities by designing a distributed deep ensemble learning model.
Experiments show that this method can reduce the false positive rate of traditional static analysis and improve the performance and accuracy of machine learning.
arXiv Detail & Related papers (2022-11-29T14:19:14Z) - Few-Shot Named Entity Recognition: A Comprehensive Study [92.40991050806544]
We investigate three schemes to improve the model generalization ability for few-shot settings.
We perform empirical comparisons on 10 public NER datasets with various proportions of labeled data.
We create new state-of-the-art results on both few-shot and training-free settings.
arXiv Detail & Related papers (2020-12-29T23:43:16Z) - CASTLE: Regularization via Auxiliary Causal Graph Discovery [89.74800176981842]
We introduce Causal Structure Learning (CASTLE) regularization and propose to regularize a neural network by jointly learning the causal relationships between variables.
CASTLE efficiently reconstructs only the features in the causal DAG that have a causal neighbor, whereas reconstruction-based regularizers suboptimally reconstruct all input features.
arXiv Detail & Related papers (2020-09-28T09:49:38Z) - A Self-Supervised Gait Encoding Approach with Locality-Awareness for 3D
Skeleton Based Person Re-Identification [65.18004601366066]
Person re-identification (Re-ID) via gait features within 3D skeleton sequences is a newly-emerging topic with several advantages.
This paper proposes a self-supervised gait encoding approach that can leverage unlabeled skeleton data to learn gait representations for person Re-ID.
arXiv Detail & Related papers (2020-09-05T16:06:04Z) - Improved Code Summarization via a Graph Neural Network [96.03715569092523]
In general, source code summarization techniques use the source code as input and outputs a natural language description.
We present an approach that uses a graph-based neural architecture that better matches the default structure of the AST to generate these summaries.
arXiv Detail & Related papers (2020-04-06T17:36:42Z) - Detecting Code Clones with Graph Neural Networkand Flow-Augmented
Abstract Syntax Tree [30.484662671342935]
We build a graph representation of programs called flow-augmented abstract syntax tree (FA-AST)
We apply two different types of graph neural networks on FA-AST to measure the similarity of code pairs.
Our approach outperforms the state-of-the-art approaches on both Google Code Jam and BigCloneBench tasks.
arXiv Detail & Related papers (2020-02-20T10:18:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.