Code Structure-Aware through Line-level Semantic Learning for Code Vulnerability Detection
- URL: http://arxiv.org/abs/2407.18877v1
- Date: Fri, 26 Jul 2024 17:15:58 GMT
- Title: Code Structure-Aware through Line-level Semantic Learning for Code Vulnerability Detection
- Authors: Ziliang Wang, Ge Li, Jia Li, Yihong Dong, Yingfei Xiong, Zhi Jin,
- Abstract summary: We propose a novel network architecture based on pre-trained code models, which incorporates structural information awareness.
We introduce a new network architecture, the Code Structure-Aware Network through Line-level Semantic Learning (CSLS), which integrates three key components: global vulnerability awareness, line-structural awareness, and sensitive-line awareness.
- Score: 44.29771620061153
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Different from the flow semantics of natural languages, programming languages are inherently rigid in structure and grammar. Existing fine-tuning methodologies for code vulnerability detection generally treat code as long text sequences, stripping away structural elements such as newlines ('/n') and whitespace. However, this approach inadvertently results in the loss of crucial structural information, diminishing the distinct characteristics of code and impairing the accuracy of vulnerability detection. To address these challenges, we propose a novel network architecture method based on pre-trained code models, which incorporates structural information awareness. We propose an enhanced code text processing workflow that retains structural elements prior to modeling. This refinement allows the model to retain and exploit line-level structural information and semantic information during the modeling process. Furthermore, we introduce a new network architecture, the Code Structure-Aware Network through Line-level Semantic Learning (CSLS), which integrates three key components: global vulnerability awareness, line-structural awareness, and sensitive-line awareness. We have conducted comprehensive experiments using vulnerability detection datasets from real-world projects. Extensive experiments were conducted on vulnerability detection datasets derived from real-world projects. The results demonstrate that our new code pre-processing flow significantly improves existing baselines (e.g., a 3\% accuracy improvement on the Devign dataset when applied to popular models such as CoderBert and UniXcoder). The proposed network architecture also demonstrates superior accuracy in detecting vulnerabilities, surpassing newly established benchmarks. These findings underscore the importance of structural information in enhancing the efficacy of code vulnerability detection models.
Related papers
- Post-Incorporating Code Structural Knowledge into LLMs via In-Context Learning for Code Translation [10.77747590700758]
Large language models (LLMs) have achieved significant advancements in software mining.
handling the syntactic structure of source code remains a challenge.
This paper employs incontext learning (ICL) to integrate code structural knowledge into pre-trained LLMs.
arXiv Detail & Related papers (2025-03-28T10:59:42Z) - A test-free semantic mistakes localization framework in Neural Code Translation [32.5036379897325]
We present EISP, a static analysis framework based on the Large Language Model (LLM)
The framework generates a semantic mapping between source code and translated code.
EISP connects each pair of sub-code fragments with fine-grained knowledge hints through an AI chain.
arXiv Detail & Related papers (2024-10-30T08:53:33Z) - Source Code Vulnerability Detection: Combining Code Language Models and Code Property Graphs [5.953617559607503]
Vul-LMGNN is a unified model that combines pre-trained code language models with code property graphs.
Vul-LMGNN constructs a code property graph that integrates various code attributes into a unified graph structure.
To effectively retain dependency information among various attributes, we introduce a gated code Graph Neural Network.
arXiv Detail & Related papers (2024-04-23T03:48:18Z) - SCALE: Constructing Structured Natural Language Comment Trees for Software Vulnerability Detection [36.37244302912536]
We propose a Structured Natural Language Comment tree-based vulnerAbiLity dEtection framework based on the pre-trained models.
The proposed Structured Natural Language Comment Tree (SCT) integrates the semantics of code statements with code execution sequences.
arXiv Detail & Related papers (2024-03-28T02:20:03Z) - CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model [55.321010757641524]
We introduce CLIP4STR, a simple yet effective STR method built upon image and text encoders of CLIP.
We scale CLIP4STR in terms of the model size, pre-training data, and training data, achieving state-of-the-art performance on 11 STR benchmarks.
arXiv Detail & Related papers (2023-05-23T12:51:20Z) - Physics of Language Models: Part 1, Learning Hierarchical Language Structures [51.68385617116854]
Transformer-based language models are effective but complex, and understanding their inner workings is a significant challenge.
We introduce a family of synthetic CFGs that produce hierarchical rules, capable of generating lengthy sentences.
We demonstrate that generative models like GPT can accurately learn this CFG language and generate sentences based on it.
arXiv Detail & Related papers (2023-05-23T04:28:16Z) - Benchmarking Language Models for Code Syntax Understanding [79.11525961219591]
Pre-trained language models have demonstrated impressive performance in both natural language processing and program understanding.
In this work, we perform the first thorough benchmarking of the state-of-the-art pre-trained models for identifying the syntactic structures of programs.
Our findings point out key limitations of existing pre-training methods for programming languages, and suggest the importance of modeling code syntactic structures.
arXiv Detail & Related papers (2022-10-26T04:47:18Z) - CSSAM:Code Search via Attention Matching of Code Semantics and
Structures [8.547332796736107]
This paper introduces a code search model named CSSAM (Code Semantics and Structures Attention Matching)
By introducing semantic and structural matching mechanisms, CSSAM effectively extracts and fuses multidimensional code features.
By leveraging the residual interaction, a matching module is designed to preserve more code semantics and descriptive features.
arXiv Detail & Related papers (2022-08-08T05:45:40Z) - Contrastive Learning for Source Code with Structural and Functional
Properties [66.10710134948478]
We present BOOST, a novel self-supervised model to focus pre-training based on the characteristics of source code.
We employ automated, structure-guided code transformation algorithms that generate functionally equivalent code that looks drastically different from the original one.
We train our model in a way that brings the functionally equivalent code closer and distinct code further through a contrastive learning objective.
arXiv Detail & Related papers (2021-10-08T02:56:43Z) - GraphCodeBERT: Pre-training Code Representations with Data Flow [97.00641522327699]
We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code.
We use data flow in the pre-training stage, which is a semantic-level structure of code that encodes the relation of "where-the-value-comes-from" between variables.
We evaluate our model on four tasks, including code search, clone detection, code translation, and code refinement.
arXiv Detail & Related papers (2020-09-17T15:25:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.