Related papers: Pre-training by Predicting Program Dependencies for Vulnerability Analysis Tasks

Pre-training by Predicting Program Dependencies for Vulnerability Analysis Tasks

URL: http://arxiv.org/abs/2402.00657v1
Date: Thu, 1 Feb 2024 15:18:19 GMT
Title: Pre-training by Predicting Program Dependencies for Vulnerability Analysis Tasks
Authors: Zhongxin Liu, Zhijie Tang, Junwei Zhang, Xin Xia, and Xiaohu Yang
Abstract summary: This work proposes two novel pre-training objectives, namely Control Dependency Prediction (CDP) and Data Dependency Prediction (DDP) CDP and DDP aim to predict the statement-level control dependencies and token-level data dependencies, respectively, in a code snippet only based on its source code. After pre-training, CDP and DDP can boost the understanding of vulnerable code during fine-tuning and can directly be used to perform dependence analysis for both partial and complete functions.
Score: 12.016029378106131
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vulnerability analysis is crucial for software security. This work focuses on using pre-training techniques to enhance the understanding of vulnerable code and boost vulnerability analysis. The code understanding ability of a pre-trained model is highly related to its pre-training objectives. The semantic structure, e.g., control and data dependencies, of code is important for vulnerability analysis. However, existing pre-training objectives either ignore such structure or focus on learning to use it. The feasibility and benefits of learning the knowledge of analyzing semantic structure have not been investigated. To this end, this work proposes two novel pre-training objectives, namely Control Dependency Prediction (CDP) and Data Dependency Prediction (DDP), which aim to predict the statement-level control dependencies and token-level data dependencies, respectively, in a code snippet only based on its source code. During pre-training, CDP and DDP can guide the model to learn the knowledge required for analyzing fine-grained dependencies in code. After pre-training, the pre-trained model can boost the understanding of vulnerable code during fine-tuning and can directly be used to perform dependence analysis for both partial and complete functions. To demonstrate the benefits of our pre-training objectives, we pre-train a Transformer model named PDBERT with CDP and DDP, fine-tune it on three vulnerability analysis tasks, i.e., vulnerability detection, vulnerability classification, and vulnerability assessment, and also evaluate it on program dependence analysis. Experimental results show that PDBERT benefits from CDP and DDP, leading to state-of-the-art performance on the three downstream tasks. Also, PDBERT achieves F1-scores of over 99% and 94% for predicting control and data dependencies, respectively, in partial and complete functions.

Related papers

DFEPT: Data Flow Embedding for Enhancing Pre-Trained Model Based Vulnerability Detection [7.802093464108404]
We propose a data flow embedding technique to enhance the performance of pre-trained models in vulnerability detection tasks. Specifically, we parse data flow graphs from function-level source code, and use the data type of the variable as the node characteristics of the DFG. Our research shows that DFEPT can provide effective vulnerability semantic information to pre-trained models, achieving an accuracy of 64.97% on the Devign dataset and an F1-Score of 47.9% on the Reveal dataset.
arXiv Detail & Related papers (2024-10-24T07:05:07Z)
In2Core: Leveraging Influence Functions for Coreset Selection in Instruction Finetuning of Large Language Models [37.45103473809928]
We propose the In2Core algorithm, which selects a coreset by analyzing the correlation between training and evaluation samples with a trained model. By applying our algorithm to instruction fine-tuning data of LLMs, we can achieve similar performance with just 50% of the training data.
arXiv Detail & Related papers (2024-08-07T05:48:05Z)
Understanding Programmatic Weak Supervision via Source-aware Influence Function [76.74549130841383]
Programmatic Weak Supervision (PWS) aggregates the source votes of multiple weak supervision sources into probabilistic training labels. We build on Influence Function (IF) to decompose the end model's training objective and then calculate the influence associated with each (data, source, class) These primitive influence score can then be used to estimate the influence of individual component PWS, such as source vote, supervision source, and training data.
arXiv Detail & Related papers (2022-05-25T15:57:24Z)
Unified Instance and Knowledge Alignment Pretraining for Aspect-based Sentiment Analysis [96.53859361560505]
Aspect-based Sentiment Analysis (ABSA) aims to determine the sentiment polarity towards an aspect. There always exists severe domain shift between the pretraining and downstream ABSA datasets. We introduce a unified alignment pretraining framework into the vanilla pretrain-finetune pipeline.
arXiv Detail & Related papers (2021-10-26T04:03:45Z)
Identifying Non-Control Security-Critical Data through Program Dependence Learning [9.764831771725952]
In data-oriented attacks, a fundamental step is to identify non-control, security-critical data. We propose a novel approach that combines traditional program analysis with deep learning. The toolchain uncovers 80 potential critical variables in Google FuzzBench.
arXiv Detail & Related papers (2021-08-27T00:28:06Z)
Federated Learning with Unreliable Clients: Performance Analysis and Mechanism Design [76.29738151117583]
Federated Learning (FL) has become a promising tool for training effective machine learning models among distributed clients. However, low quality models could be uploaded to the aggregator server by unreliable clients, leading to a degradation or even a collapse of training. We model these unreliable behaviors of clients and propose a defensive mechanism to mitigate such a security risk.
arXiv Detail & Related papers (2021-05-10T08:02:27Z)
Relate and Predict: Structure-Aware Prediction with Jointly Optimized Neural DAG [13.636680313054631]
We propose a deep neural network framework, dGAP, to learn neural dependency Graph and optimize structure-Aware target Prediction. dGAP trains towards a structure self-supervision loss and a target prediction loss jointly. We empirically evaluate dGAP on multiple simulated and real datasets.
arXiv Detail & Related papers (2021-03-03T13:55:12Z)
Robust Pre-Training by Adversarial Contrastive Learning [120.33706897927391]
Recent work has shown that, when integrated with adversarial training, self-supervised pre-training can lead to state-of-the-art robustness. We improve robustness-aware self-supervised pre-training by learning representations consistent under both data augmentations and adversarial perturbations.
arXiv Detail & Related papers (2020-10-26T04:44:43Z)
Accurate and Robust Feature Importance Estimation under Distribution Shifts [49.58991359544005]
PRoFILE is a novel feature importance estimation method. We show significant improvements over state-of-the-art approaches, both in terms of fidelity and robustness.
arXiv Detail & Related papers (2020-09-30T05:29:01Z)
Estimating Structural Target Functions using Machine Learning and Influence Functions [103.47897241856603]
We propose a new framework for statistical machine learning of target functions arising as identifiable functionals from statistical models. This framework is problem- and model-agnostic and can be used to estimate a broad variety of target parameters of interest in applied statistics. We put particular focus on so-called coarsening at random/doubly robust problems with partially unobserved information.
arXiv Detail & Related papers (2020-08-14T16:48:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.